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0 A system for creating word models comprising 
means for making an acoustic model from one or 
more utterances of word, means for enabling a user 
to associate a sequence of textual characters with 
that acoustic nnodel, said means including means for 
indicating to the user a menu of one or more se- 
quences of textual characters, means for enabling 
the user to select a given character sequence from 
the menu, means for enabling the user to edit the 
selected character sequence to make it represent a 
different sequence of characters, means for associat- 
ing said edited character sequence with said acous- 
tic model. 
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Speech Recognition System. 



This invention relates to methods and appara- 
tus for speech recognition and for training nnodels 
for use in speech recognition, and. in particular, to 
such methods and apparatus which decrease the 
linne the user must spend training the recognition 
system and which increase the recognition perfor- 
mance- 
According to one aspect of the present inven- 
tion a system for creating word models is provided, 
the system including means for making an acoustic 
model from one or more utterances of a word and 
means for enabling a user to associate a sequence 
of textual characters with that acoustic model, this 
latter means including means for indicating to the 
user a menu of character sequences: means for 
enabling the user to select a given character se- 
quence from the menu; means for enabling the 
user to edit the selected character sequence: and 
means for associating the edited character se- 
quence with the acoustic model. 

Preferably the menu of character sequences is 
a list of best matching words produced by speech 
recognition of one of the trailing utterances, and the 
user can select and edit the desired menu choice 
by voice. It is also preferred that the system is 
used as part of a speech recognition system which 
produces as its output for each spoken word either 
the word recognised or the word edited and se- 
lected for that spoken word. 

The invention will become more evident upon 
reading the following description of the preferred 
embodiments in conjunction with the accompany- 
ing drawings, in which: 

FIGURE 1 is a schematic block diagram of 
the functional steps of a discrete utterance speech 
recognition system which embodies the present 
invention; 

FIGURE 2 is a flow chart of the functional 
steps used in the acquisition of new acoustic to- 
kens for use in the speech recognition systems of 
Figure 1 and 25; 

FIGURE 3 is a schematic block diagram of 
the recognition algorithm and related data struc- 
tures used in the embodiment of the invention 
shown in Figure 1; 

FIGURE 4 is a schematic block diagram of 
the functional steps performed in a well known 
discrete utterance recognition algorithm which uses 
hidden Markov models and which is used as the 
recognition algorithm in the embodiment of Figure 
1: 

FIGURE 5 is a schematic block diagram of 
the functional steps involved in the training or 
building of acoustic models for a hidden Markov 
model which is used in the recognition systems of 



Figures 1 and 25; 

FIGURE 5A is a modification of the block 
diagram of Figure 5 for recognition of continuous 
speech using hidden Markov models; 
5 FIGURE 6 is a schematic block diagram of 

the forward pass of the well known method of 
dynamic programming applied as the first step of 
training acoustic models or of continuous speech 
recognition: 

10 FIGURE 7 is a schematic block diagram of 

the traceback step for training acoustic models or 
for continuous speech recognition; 

FIGURE 8 is a schematic diagram of acous- 
tic model training of the type described in Figures 

;5 5.6. and 7, which also shows how data derived in 
such training can be used to train phonemic word 
models: 

FIG. 9 is a flowchart of the probabilistic 
natural language modeling process used in the 

20 embodiments of FIG. 1 and 25. 

FIGS. 10-24 illustrate an example of the se- 
quence of computer displays which result when the 
program of FIG. 1 is used to enter the phrase "This 
invention relates to" into the system; 

25 FIG. 25 is a schematic block diagram of the 

functional steps of a connected phrase speech 
recognition system which embodies the present 
invention; 

FIG. 26 is a continuation of the schematic 

30 block diagram of FIG. 25, illustrating the steps of 
the phrase edit mode of that system; 

FIGS. 27-36 illustrate an example of the se- 
quence of computer displays which result when the 
program of FIGS. 25 and 26 is used to enter 

35 phrases into the system. 

Figure 1 is a schematic block diagram of the 
functional steps performed in a preferred embodi- 
ment of the present invention designed to recog- 
nize separately spoken words. By "word" in this 

40 context we mean individual word or group of words 
which is spoken as a single utterance and for which 
a single utterance acoustic model has been trained. 

The preferred embodiment of the invention de- 
scribed with regard to FIG. 1 is designed to op- 

45 erate as a terminate-and-stay resident keyboard 
emulator. Such keyboard emulators are well known 
in the personal computer arts as programs which 
are loaded into memory and then exited, so that 
they can be called by an application program with 

50 which they are used every time that application 
program requests input from a keyboard. When 
such an application calls a subroutine to get input 
from the keyboard, the emulator takes over the call 
and performs the emulator's program rather than 
the operating system's standard keyboard process- 
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ing routine. Then when the emulator is done, it 
returns to the application program, and if it pro- 
duces any output, it produces that output as a 
string of characters in the same form as if those 
characters had been types on a keyboard. 

The program of FIG. 1 starts at 100. shown in 
the lower left hand side of that figure, by perform- 
ing step 101. Step 101 initializes the system. It 
loads the remainder of the dictation program of 
FIG. 1 into memory; and it initializes that dictation 
program, including setting the recognition mode to 
TEXTMODE. which enables the first word dictated 
by the user to be recognized as a word to be 
entered into text: including clearing SAV__TOK, a 
buffer where a previously recognized word may be 
stored; and including clearing the string START- 
STRING, which is used for correcting mistakenly 
recognized words. Then step 101 exits the dictation 
program, leaving it resident in memory so that it 
can be called by the application program. Then it 
loads an application program and enters it at step 
102. In the examples shown in this specification, 
the application program is a word processing pro- 
gram. But it should be clear that application pro- 
grams besides word processors can be used with 
the dictation program of FIG. 1 . 

On the first entry into the application program 
in step 102, that program initializes and. once it is 
done initializing, it requests user input from the 
keyboard. This request causes the application pro- 
gram to, in effect, call the dictation program of FIG. 
1, starting with input detection loop 103 shown at 
the top of FIG. 1. Loop 103 repeatedly performs 
step 1 1 1 which tests for an utterance detection and 
step 104 which tests for a keystroke detection. The 
program continues in loop 103 until either step 104 
or 1 1 1 detects some user input. If a keystroke is 
generated, it is detected by step 104, and the test 
of step 801 is applied. 

The test in step 801 is satisfied if the system if 
currently displaying a window (either a choice win- 
dow or a definition window) to the user. That is, the 
test will be satisfied if either a choice window iias 
been displayed by either step 176 or step 274 as 
explained below, or a definition window has been 
displayed by step 270 as explained below, and the 
display has not since been cleared by step 204 or 
step 192. If the test of step 801 is not satisfied, 
then the keystroke is sent directly to the applica- 
tion. In this mode the user can type normally, as if 
no speech recognition were being used. If the test 
in step 801 is satisfied, it causes step 105 to get 
the keystroke, and to supply it for use in the 
sequence of branching tests represented by steps 
106-110. The step 105 includes substeps which 
cause the program to wait briefly, if the keystroke 
received is a function key "f1" through "f9". for a 
specified double-click period, such as a third of a 



second, to see if the user presses the same func- 
tion key again. If he does, the keystroke is inter- 
preted as a double click associated with the same 
number as the twice pressed function key, and a 
5 character representing both the number of the 
function key pressed and that fact that it was 
pressed twice is sent to the tests of steps 106 - 
110. 

!f, on the other hand, the user speaks an utter- 

jo ance, it is detected by the step 111. and a token is 
made of the utterance and speech recognition is 
performed upon it. 

Referring now to FIG. 2, the steps which pro- 
duce a token in response to each utterance spoken 

75 by the user will be described. In the preferred 
embodiment, most of the steps of FIG. 2 are per- 
formed by hardware other than the CPU which 
performs the steps of FIG. 1. so that the steps of 
FIG. 2 can be performed in parallel with those of 

20 FIG. 1 . 

When the user speaks an utterance into the 
microphone 811, it causes a step 112 to receive an 
acoustic signal from that microphone. Actually the 
step 112 is constantly receiving an acoustic signal 

25 from the microphone, as long as that microphone is 
on, which it normally is when the user is using the 
system for dictation. As a result, the steps of FIG. 2 
should be viewed as the steps of a pipe-lined 
process which are performed in parallel. The ana- 

30 log electrical signal from microphone 11 received 
by step 112 is converted to a sequence of digital 
samples by the analog-to-digital conversion step 
113. A fast Fourier transform step 114 is used to 
compute a magnitude spectrum for each frame of 

35 speech. Methods for computing a magnitude spec- 
trum from a sequence of digital samples are well- 
known to those skilled in the arts of digital signal 
processing. In a preferred embodiment of the 
present invention, the A/D conversion 113 samples 

40 at 12,000 Hertz, creating 240 digital samples during 
a 20 msec speech frame. The 240 digital samples 
are multiplied by an appropriate window function, 
such as a Hamming window and then the FFT 114 
is performed to compute a frequency spectrum, 

45 Other methods of digital signal processing, such as 
linear predictive coding (LPC) and other acoustic 
parameterizations such as cepstral coefficients may 
be used just as well. 

After the magnitude spectrum has been com- 

50 puted by the FFT 114, a lest is performed in step 
115 to determine if the sound currently being re- 
ceived by the microphone and transformed by the 
FFT appears to be the start of an utterance. Al- 
though there are many ways of detecting utter- 

55 ances known in the art of speech recognition, in 
the preferred embodiment of the invention, step 
115 considers an utterance to have started when 
the FFT output indicates that the audio signal has 
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Deen above a relatively loud volume for 3 frames. If 
the user places his mouth relatively close to the 
microphone ill. his utterances will be substantially 
louder than most other sounds detected by the 
microphone, except relatively brief sounds such as 
bangs, or inadvertent brief sounds made by the 
user. As a result, this simple method of utterance 
detection works fairly well. 

If step 115 detects an utterance, step 116 
generates an utterance detection flag, which re- 
mains set until step 111 of FIG. 1 next tests that 
flag. If an utterance is detected, step 118 forms a 
token of it. In the preferred embodiment this token 
consists of a sequence of all frames produced by 
the FFT 114, starting sixteen frames before the 
frame at which the utterance detection was gen- 
erated, and continuing through to the frame at 
which the FFT indicates the amplitude of the signal 
has been below a threshold level associated with 
background noise for approximately a quarter sec- 
ond. This sequence of frames is the token repre- 
senting the utterance. 

It should be understood that vector quantization 
step 117 may optionally be used in the formation 
of tokens. That is, some successful speech rec- 
ognition systems use vector quantization to com- 
press the data required to represent the information 
produced by the FFT for each frame. Other sys- 
tems do not. The present invention can work well 
with either form of acoustic analysis. As is well- 
known to those skilled in the arts of digital signal 
processing, the vector quantization step 117 con- 
sists of matching the spectral magnitude vector for 
the current speech frame against a code book of 
sample spectra. The vector of spectral magnitude 
values is then replaced by a single number, name- 
ly the index m the code book of the best matching 
sample spectrum. In step 118 a token is formed 
either by forming an array from the sequence of 
spectral magnitude vectors (if vector quantization 
was not used in step 117) or by forming a se- 
quence of vector quantization indices for the se- 
quence of speech frames. 

Returning now to FIG. 1. when step 111 de- 
tects an utterance, it causes the program to ad- 
vance to step 119, which stores the token pro- 
duced by step 118 of FIG. 2 in a memory buffer 
called TEMP_TOK, which is short for "temporary 
token". The tests of steps 120 and 121 determine if 
the recognition mode has been set to EDITMODE 
or TEXTMODE. If the recognition mode has been 
set to EDITMODE. step 120 causes step 122 to 
perform EDITMODE speech recognition on the to- 
ken stored in TEMP__TOK. If. on the other hand, 
the recognition mode has been set to TEXTMODE, 
step 121 causes step 123 to perform TEXTMODE 
recognition upon TEMP_TOK. TEXTMODE rec- 
ognition is the normal recognition mode which en- 



ables the user to dictate words for inclusion in the 
textual output of the system. It uses a vocabulary 
which contains all the text words 124 for which the 
system has acoustic models (shown in FIG. 3). The 

5 current embodiment of the invention can have a 
recognition vocabulary of over five thousand text 
words once the acoustic models for these words 
have been trained. In addition the TEXTMODE 
vocabulary contains menu selection commands 

10 125. such as "pick_one". "pick_two", etc., edit 
menu choice commands 126, such as "edit_one". 
"edtt_two", etc, and letter commands 127. such as 
"starts_atpha", "starts_bravo". etc, (shown in 
FIG. 3) which are used to edit mistakenly recog- 

15 nized words, as is explained in greater detail be- 
low. EDITMODE recognition uses a much smaller 
active vocabularly including the menu selection 
commands 125, edit menu choice commands 126. 
and a simplified version of the letter commands 

20 1 27, such as "alpha", "bravo", etc., but not includ- 
ing the large vocabulary of text words. 

Except for their different active vocabularies. 
TEXTMODE and EDITMODE use the same rec- 
ognition algorithm 129, which is shown in a simpll- 

25 fied graphical form in FIG. 3. As FIG. 3 indicates, 
this algorithm compares the sequence of individual 
frames 130 which form the TEMP_TOK 131 with 
each of a plurality of acoustic word models 132. 
The preferred embodiment uses dynamic program- 

30 ing that seeks to time align the sequence of in- 
dividual nodes 133 of each word model 132 against 
the sequence of frame 130 which form 
TEMP__TOK in a manner which maximizes the 
similarity between each node of the word model 

35 and the frames against which it is time aligned. A 
score is computed for each such time aligned 
match, based on the sum of the dissimilarity be- 
tween the acoustic information in each frame and 
the acoustic model of the node against which it is 

40 time aligned. The word with the lowest sum of such 
distances are then selected as the best scoring 
words. If language model filtering is used, a partial 
score reflecting the probability of each word occur- 
ring in the present language context is added to 

45 the score of that word before selecting the best 
scoring word, so that words which the language 
model indicates are most probable in the current 
context are more likely to be selected. 

Referring now to FIG, 4. a much more detailed 

50 description of the speech recognition algorithm 129 
is used in the preferred embodiment is illustrated. 
This algorithm uses the method of hidden Markov 
models. As is well-known to those skilled in the arts 
of automatic speech recognition, the hidden Mar- 

55 kov model method evaluates each word in the 
active vocabulary by representing the acoustic 
model 132 (shown in FIG. 3) for each word as a 
hidden Markov process and by computing the 



4 



7 



EP 0 376 501 A2 



8 



probability of each word of generating the current 
acoustic token 131 (shown in FIG. 3) as a pro- 
babilistic function of tfiat hidden Markov process. In 
the preferred embodiment of this invention, tfie 
word scores are represented as the negative loga- 
rithms of probabilities, so all scores are non-nega- 
tive, and a score of zero represents a probabthty of 
one, that is. a perfect score. 

The first step 151 of the matching routine of 
FIG. 4 is to initialize tfie scores to represent the 
state of the Markov models at the beginning of the 
token. That is, for each word in the vocabulary, 
initialization step 152 is performed. In step 152, thie 
score of the first node 133 (shown in FIG. 3) of 
each word model is set to zero (indicating that the 
probability is one that the Markov process is in the 
first node). The score of every other node 133 is 
set to BAp_SCORE, a large positive number 
(essentially corresponding to a probability of zero) 
such that any node with a score of BAD_SCORE 
is thresholded as being an impossible event. 

The main matcfiing process is then performed 
by step 153, which executes step 155 for each 
frame 130 (shown in FIG. 3) of the token. Step 155. 
in turn, performs the two steps 156 and 162 for 
each word in the active vocabulary. 

For each node in a given word, step 156 per- 
forms the two steps 157 and 158. In step 157 a 
score. pass[NODE]. is computed for the event 
which corresponds to the Markov process having 
already been in the current NODE during the pre- 
vious frame and remaining in the current NODE for 
the current FRAME. That is, score(WORD,NODE] 
is the score of the cun^ent NODE up through the 
previous frame, and trans_probtNODE,NODE] is 
(the negative logarithm oO the probability of a 
Markov transition from NODE to itself. 

Step 158 then computes a score for each path 
to the current NODE from any legal predecessor 
node, PNODE. A predecessor node is a node from 
which there is a non-zero probability of a transition 
to the current NODE. For each predecessor node. 
PNODE. step 158 performs steps 159 and 160. 
Step 159 computes the score of the event consist- 
ing of the Markov process having been in node 
PNODE for the previous frame and of transitioning 
to node NODE for the current frame. Step 160 then 
compares this new_score with pass[NODEl, which 
was first computed in step 157. If new_score is 
better than pass[NODE]. then step 161 sets pass- 
[NODE] to new_score. That is, as each path to the 
current NODE is considered, pass[NODE] is always 
updated to represent the best score for any path 
leading to the cun-ent NODE for the cunrent 
FRAME. 

After step 158 is completed for each predeces- 
sor PNODE, and step 156 is completed for each 
NODE in the current WORD, step 162 is iterated 



for each NODE in the current WORD. Step 162 is 
executed as a separate loop because it over-writes 
the array score[WORD,NODE]. so it must wait until 
step 156 has finished using the array score- 

5 [WORD.NODE] before it begins to over-write it. For 
each NODE of the current WORD, step 162 ex- 
ecutes step 163. Step 163 takes the score pass- 
[NODE] and adds the score for the current FRAME 
for which step 153 is being iterated. 

10 The function label(WORD.NODE) used in step 

163 allows a variety of styles of word models to be 
used within the hidden Markov model methodology 
described in FIG. 3. As is well-known to those 
skilled in the art of automatic speech recognition, 

J5 there are two main classes of word models used in 
hidden Markov modeling and in other automatic 
speech recognition methods based on dynamic 
prograinming: whole word models and phoneme- 
based models of the type which are described in 

20 greater detail beiow with regard to FIG. 8. It is also 
possible to use more elaborate models such as 
phoneme-in-context which can be a combination of 
simple phoneme models and whole word models. 
Step 163 can accommodate any of these methods 

25 by the appropriate choice of the function for label- 
(WORD.NODE). For whole word models, label- 
(WORD.NODE) returns a unique value for each 
NODE of each WORD. For phoneme-based 
models, label(WORD.NODE) returns the same val- 

30 ue whenever the same phoneme is shared by two 
or more words. For phoneme-in-context models, 
iabel(WORD.NODE) returns a value uniquely deter- 
mined by the current phoneme and its context. 
There is also a separation of speech recogni- 

35 tion systems based on whether a vector quantizer 
is used, as described in the discussion of FIG. 2. 
Systems which use vector quantization are called 
discrete-alphabet systems, while systems that do 
not use vector quantization are called continuous- 

40 parameter systems. The function obs__prob(label- 
(WORD,NODE),FRAME) in step 1 63 allows the pre- 
ferred embodiment of the present invention to be 
used as either a discrete-alphabet system or as a 
continuous-parameter system. If a discrete-alpha- 

45 bet is used, then the function obs prob(label- 

{WORD,NODE).FRAME) becomes simply a table 
lookup in an array, where the first index is the 
value of label (WORD.NODE) and the second index 
is the vector quantizer label for the current FRAME. 

50 In a continuous parameter system, each value of 
label(WORD,NODE) is associated with a parametric 
probability distribution for the vector of acoustic 
parameters. For example, the vector of acoustic 
parameters may be modeled as a multi-variant 

55 Gaussian distribution. The function obs_prob- 
(label(WORD,NODE), FRAME) then computes the 
negative logarithm of the probability of the vector 
of parameters associated with the current FRAME 
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for the parametric probability distribution which is 
associated with the current value of label- 
(WORD.NODE). 

In any case, step 163 computes the negative 
loganthm of the probability of the current FRAME 
for the current NODE. This observation score is 
added to pass[NODE] and the sum is stored in 
score[WORD.NODE], representing the score of the 
best path leading to the current NODE for the 
current FRAME. 

In the preferred embodiment, each word model 
has a special final node. last_node(WORD), {I33f 
in FIG. 3), corresponding to the silence (or back- 
ground noise) at the end of the discrete utterance. 
These extra nodes allow for the possibility that, 
near the end of the observed utterance* for some 
words the best (most probable) interpretation may 
be that the word has already been completed, 
while for other words the best interpretation may 
be that the Markov process is still in a state cor- 
responding to a low energy speech sound, such as 
a stop or fricative. Having an extra node in each 
word model allows the ending time decision to be 
made separately for each word. 

Step 162 completes the steps which are done 
for each word of the vocabulary to make up step 
155. Step 155 completes the steps which are done 
for every FRAME in the token to make up step 
153. After step 153 has been iterated for every 
FRAME of the token, step 1 66 outputs the "choice 
list", which consists of the best scoring word and 
the 8 next best scoring alternate word choices. 
After step 153 has been done for every FRAME of 
the utterance. score[WORD.Iast__node{WORD)] for 
a given word will be the score for the hypothesis 
that the word has been complete and that the 
Markov process is in the special final node cor- 
responding to the silence after the word. Step 166 
adds the language model score for each WORD to 
the final score of the last node of the WORD. That 
is, step 166 chooses the words to be put on the 
choice list by comparing the values of the expres- 
sion (score[WORDJast_node(WORD)] -log(Ps( 
WORD I CONTEXT )) ). where is the language 
model probability as computed in FIG. 9, for each 
of the words in the active vocabulary, and, by 
techniques well-known to those skilled in the arts of 
digital computing, selecting the words with the nine 
best scores. In the preferred embodiment a thresh- 
old Is set such that any word with a score cor- 
responding to a probability less than 2~'^ will not 
be displayed (normalizing the word probabilities to 
sum to one). Usually less than 9 words will satisfy 
this criterion, so less than 9 words will be dis- 
played. This step completes the preferred embodi- 
ment of speech recognition algorithm 129 for dis- 
crete utterance recognition. 

Although the match routine 129 has been de- 



scnbed with regard to a particular embodiment, 
many variations on the basic routine described in 
FIG. 4. are well-known to those skilled in the art of 
automatic speech recognition. This invention is 
5 compatible with many such variations. In particular, 
this invention is compatible with variations which 
may be used to improve the efficiency of large 
vocabulary speech recognition. For example, to 
reduce the amount of computation, various 
10 thresholding schemes may be used so that the 
match routine 129 need only evaluate the best 
scoring nodes for each frame. The present inven- 
tion would work perfectly well with such threshol- 
ding schemes, because the same hidden Markov 
75 models would be used, merely fewer nodes would 
be evaluated. 

Another variation on the embodiment of FIG. 4 
would be to normalize the scores so that the com- 
putation could be done in 1 6-bit integer arithmetic. 

20 Still another variation of the embodiment of FIG. 4 
replaces step 160 by a combine-score routine that 
sets pass[NODE] to the logarithm of the sum of 
new_score and pass[NODE]. Again, the present 
invention would work with such variations. 

25 Other variations of the basic match routine of 

FIG. 4 which improve the efficiency for large vo- 
cabulary speech recognition include a rapid-match 
or prefilter stage which quickly eliminates all but 
the most likely word candidates, or a lexical re- 

30 trieval stage which hypothesizes word candidates 
bottom-up from acoustic evidence. Among such 
variations, the present invention is compatible with 
those that allow their associated acoustic models to 
be built incrementally as individual words are ad- 
os ded to the vocabulary. 

It will be seen, in the discussion below of FIG. 
5A, that the present invention is compatible with an 
alternate embodiment of the recognition algorithm 
129 that is designed for connected and continuous 

40 Speech, as will be explained in greater detail with 
regard to the embodiment of the invention dis- 
closed in FIG. 25, 

Returning now to FIG. 1, once the speech 
recognition routine called in either the EDITMODE 

45 or TEXTMODE recognition of step 122 or step 123, 
respectively, has selected a best scoring word and 
a list of the next best scoring words, the cor- 
responding machine response, which in the em- 
bodiment of FIG. 1 is a string of one or more 

50 characters, associated with that best scoring word 
is supplied to the branching tests represented by 
the steps 106-110. just as is any keystroke input 
supplied by the user (if a choice window is dis- 
played so the test in step 801 is satisfied). 

55 If the user's input is an utterance recognized as 

a text word, its associated machine response, or 
output, is a string of printable ASCII characters 
representing the spelling of its associated word 
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preceded by a special character which serves no 
purpose except to distinguish single letter text 
words from letter commands. When this machine 
response is suppHed to the sequence of lest com- 
prised of steps 106-1 10, the test of step 108 will be 
satisfied and the program will advance to the 
branch which starts with step 170. The test of step 
108 will also be passed if step 123 fails to find any 
best choice in response to an utterance. It is as- 
sumed that any untrained words spoken by the 
user when using the program of FIG. 1 are text 
words, since It is assumed that before the user 
attempts to speak any of the commands whose 
outputs meet the tests of steps 106, 107, 109. or 
110, he or she will have trained proper models for 
those commands using traditional acoustic model 
training techniques. 

!f the test of step 108 is passed, the program 
advances to step 170. Step 170 sets the recogni- 
tion mode to TEXTMODE, since the system as- 
sumes that after a user dictates one text word, his 
next utterance may well be another text word. Then 
step 172 saves the token saved tn TEMP__TOK by 
storing it in a buffer named REC_TOK. which is 
short for "recognized utterance". This is done so 
that if the user utters a spoken command to re- 
recognize the token formerly stored in TEMP-TOK, 
it will be available for such recognition and will not 
be lost due to having been over written by any 
such spoken command. 

Next step 174 confirms the top choice, or best 
scoring word, from the recognition or rerecognition 
of the previous text word, if any, displayed at the 
time the token jusf saved in REC_TOK was ut- 
tered. Then step 176 erases the prior display and 
displays the choices from the recognition of the 
token just saved, with the choices displayed In 
order, with the top choice, or best scoring word, 
first, and with each choice having next to it a 
function key number, "f1" through "f9". This dis- 
play is made In the form of a pop-up active window 
display 701 shown in FIGS. 10 - 22. Of course, 
when the program of FIG. 1 is first entered, there is 
no current active window display 701, because 
there has been no previous recognition, and thus 
no top choice to confirm. 

In the preferred embodiments, word choices 
are displayed by means of pop-up windows, such 
as active window 70 because the preferred em- 
bodiments are designed to be used as a voice 
keyboard in conjunction with application programs 
which have their own screens, tt should be under- 
stood, however, that in other embodiments of the 
present invention other methods of presenting the 
recognition choices . could be easily implemented 
by those skilled in the arts of interactive computing 
systems. For example, if the system was designed 
for situations in which the operator is performing 



some task (such as viewing through a microscope) 
which prevents the operator from viewing a com- 
puter display screen, the choice list can be pre- 
sented in spoken form using synthetic speech. 

5 Such synthetic speech could also be used by a 
visually impaired user. 

Once step 176 has displayed the best scoring 
words, the program advances to step 803 which 
tests whether SAV_TOK is non-empty. SAV_TOK 

10 will hold a previously recognized word which is 
now being confirmed by test 108, if step 264 have 
been executed since the last time that either step 
194 or step 216 has been executed. If SAV_TOK 
is not empty, the program goes to step 208, ex- 

15 plained below. If SAV_TOK is empty, the program 
advances to step 178, which stores the top word 
choice confirmed in step 174, if any, in the lan- 
guage context buffer. In the embodiment of FIG. 1, 
this buffer stores the last two confirmed words. If 

20 the system is to be used with word processors or 
other applications that let a user move around a 
body of text and insert words other than in a 
sequential order, it is desirable that means be 
provided to indicate to the system that the next 

25 word to be dictated is not to be located after that 
last word recognized by the language model. Upon 
such an indication the language context buffer can 
be cleared or have its contents set to the values of 
two words immediately preceding the location in 

30 the body of text at which the next dictated word is 
to be inserted. If the language context is cleared, 
the language model will base its score only on the 
context independent probability of words being 
spoken. 

35 After step 178 stores the confirmed word, if 

any. in the language context buffer, step 180 uses 
the confirmed word, if any, to update the language 
model used by the recognition system. Several 
such language models are well-known to those 

40 skilled in the art of natural language speech rec- 
ognition, and this invention would work with any of 
them. In the preferred embodiment, the language 
model is a probabilistic language model based on 
the counts of word bigrams, as shown in more 

45 detail in FIG. 9. That is, for each pairs of words W1 
and W2, statistics are kept of how often the pair 
occurs as successive words in the text. During 
recognition, where W2 is the word to be recog- 
nized and W1 is the most recently confimned word 

50 stored in the language context buffer, the probabil- 
ity of W2 is estimated as the number of counts for 
. the pair W1, W2 divided by the total number of 
counts for W1 , as shown at step 1 82. 

A context-free probability is also estimated, as 

55 shown at step 184. The context-free probability of 
W2 is simply the number of counts of W2 divided 
by the total number of counts. 

In step 186. the context-dependent probability 
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estimate and the context-free probability estinnate 
are combined Into a smoothed context-dependent 
estimate, by adding the two probability estimates, 
giving a weight of (1-w) to the context-dependent 
estimate and a weight of w to the context-free 
estimate- This smoothing allows for words that 
have never occurred before in the context of a 
particular Wl. A typical value for w in the preferred 
embodiment would be w = 0.1, but w may be 
empincally adjusted by the user to optimize perfor- 
mance based on the amount of training text avail- 
able for estimating the context-dependent model. If 
more text is available, a smaller value of w may be 
used. 

When step 180 updates the language model, it 
uses the most recent confirmed word stored m the 
language context buffer to update the count of both 
W2 and the total count used in the probability 
calculation of steps 182. 184. and 188. It also uses 
the second most recently confirmed word as the 
value of Wl in the steps 182 and 186. and as the 
value of CONTEXT in step 166 of FIG. 4. 

After the language model has been updated in 
step 180, the program advances to step 188 which 
produces the confirmed word as an output to the 
application program that the dictation program of 
FIG. 1 is being used with. It produces this output 
as a string of ASCII characters, as if those char- 
acters had been typed on the keyboard of the 
system's computer. As those skilled in the art of 
terminate-and-stay-resident programs wilt under- 
stand, it accomplished this in a manner which is a 
too complicated to fit onto FIG. 1. Actually step 188 
returns only the first character of the output string 
to the application program and places the rest in 
an output buffer. Then each subsequent time the 
application calls the dictation program for input 
until the output string is empty, that program, in- 
stead of advancing to the polling loop 103 at the 
top of FIG. 1, merely returns with the next char- 
acter of the output buffer. For purposes of sim- 
plification, however, in the remainder of this speci- 
fication step 188 and its corresponding 188a in 
FIG. 25 are treated as if they output the entire 
string at one time. In the preferred embodiment, 
step 188 and the corresponding step I88a also 
apply punctuation rules that put spaces between 
words and capitalize the first word of a sentence. 

Once step 188 is performed, the dictation pro- 
gram returns to the application program to enable 
the application program to respond to the output 
string, if any, as is indicated by step 102. In the 
example shown with regard to FIGS. 10-23, the 
application program Is a word processor which 
represents and can edit a body of text, and which 
has a cursor 700 which indicates where in the body 
of text the next input from the user will be inserted. 
Thus in step 102 the word processor inserts any 
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output string supplied by step 188 into its text at 
the current cursor location as if that string had 
been typed in by keyboard. 

Once the application program has handled all 

5 the characters in the output string, it requests the 
next keystroke. This causes the program flow again 
to return to the polling loop 103 at the top of FIG. 
1. Loop 103 polls for the user*s next input. If it is a 
keystroke, steps 104 and 105 fetch if. If it Is an 

;o utterance, steps 111. 119 and either 102 and 122 
or 121 and 123 recognize it. Then the program 
again advances to the branching tests of steps 
106-110. 

If the user input consisted of a delete com- 
/5 mand in the form of the pressing of the "delete" 
key or an utterance recognized as 
"delete_utterance". the test of step 106 will be 
satisfied, and the dictate program will advance to 
the program branch starting with step 190. The 
20 delete command is designed to let the user in- 
dicate that the last utterance recognized as a text 
word for which an active window display 701 is 
shown, should be abandoned and erased from the 
system. This command is often used if the recog- 
25 nizer mistakenly treats an inadvertent sound made 
by the user or a background sound as an utter- 
ance. It also may be used if the user decides to 
replace the last word he dictated with another 
word. 

30 Step 190 sets the dictation program to TEXT- 

MODE, since it assumes that after a delete com- 
mand the next word the user enters will be a word 
to be entered into text. Then step 192 clears 
REC_TOK. to indicate that no further rerecognition 

35 is to take place upon the token formerly stored in 
that buffer. Step 192 clears the STARTSTRING so 
that it will be empty the next time the user decides 
to employ it. It also clears the pop-up windows 
which the dictation program has placed on the 

40 computer's screen. Next step 194 clears a buffer 
names SAV_TOK, which is short of 
"saved_token". to indicate that the utterance in 
REC_TOK and SAV_TOK should not be used to 
make training utterances or to train word models in 

45 steps 208 and 214. described below. Then step 
196 aborts the current utterance for which a win- 
dow 701 was displayed. It does this by returning to 
loop 103 at the top of FIG. 1 for the user's next 
input. 

50 If, when the dictation program advances to the 

tests of steps 106-110, the input detected from the 
user corresponds is a valid pick-choice command, 
the test of step 107 is satisfied, and the branch of 
the program beneath that step is performed. The 

55 valid pick-choice commands consist of any of the 
function keys "f1 " - "f9" or any of the voice com- 
mands "pick_one" through "pick_nine" which 
correspond to the number of a word choice dis- 
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played in either the active window display 701 or 
the Dictionary Window 702 {See FIGS. 10-23). 
Such commands are used= to indicate that a word 
having a given number "fl** through **f9'* in either 
the window 701 or 702 is the desired word for the 
utterance associated with that display. 

If the user does enter a pick-choice command, 
step 107 branches the program to step 200.- Step 
200 sets the recognition mode to TEXTMODE. 
since it is assumed that after the user confirms a 
given word choice he or she may want to enter 
another text word. Then step 202 causes the token 
currently In REC__TOK, which represents the utter- 
ance associated with the word choice the user has 

picked, to be saved in SAV JOK. As is explained 

below, this will cause that token to be used in the 
training of an acoustic word model 132 (shown in 
FIG. 3) of the word the user has picked. Next Step 
204 clears REC_TOK to indicate that no further 
rerecognition is to take place on the token formerly 
stored there. It also clears STARTSTRING and any 
pop-up windows on the screen. Then Step 206 
confirms the word choice having the number asso- 
ciated with the pick-choice command the user has 
just entered. 

After this is done the program advances to 
steps 208, 214, and 216. which all relate to using 
the token associated with the utterance of the word 
just confirmed to train up acoustic models of that 
word. Step 208 causes the token 131 stored in 
SAV__TOK. if that buffer is not empty, to be stored 
in a tokenstore 209 (shown in FIG. 3) in association 
with the word 210 just confirmed by the pick- 
choice command, and then clears SAV__TOK. Step 
214 ftnds ail the tokens previously stored in the 
tokenstore in association with the just confirmed 
word and builds a new acoustic model for that 
word with those tokens (as is indicated in the lower 
right hand corner of FIG. 3). Step 216 stores this 
acoustic word model with the other acoustic word 
models used by the system's recognizer (as is 
indicated by the arrow leading from the model 
builder to the acoustic model store in FIG. 3) 

The savings of tokens associated with con- 
firmed words and the use of multiple tokens saved 
in conjunction with the confirmed word performed 
by the program of FIG. 1 provide an effective and 
efficient means of adaptive speech recognition. 
This method only saves tokens for utterances when 
the user specifically confirmed the word to be 
associated with that utterance by means of a pick- 
choice command. This lets the user greatly reduce 
the chance that an incorrectly recognized utterance 
will be used to train models of the wrong words. It 
also tends to greatly reduce unnecessary training, 
since normally the pick-choice command is only 
used to pick word other than the top-choice, and 
thus it reduces the likelihood that the system will 



replace word models that are producing correct 
recognition. It is true that a user, by pressing "f1 
can use a pick-choice command to pick the top 
choice in an utterance's first recognition (that per- 

5 formed in step 123). But normally the user will not. 
since it is much easier to confirm such an initial top 
choice merely by speaking another text word, 
which will cause the branch of programming under 
step 108 to be performed. 

10 The use of multiple tokens to train word 
models in an adaptive environment is very useful. 
Traditional adaptive training techniques usually in- 
volve taking data from one new token of a word 
and using it to modify the parameters of a previous 

;5 model of that word. This adding of data from a new 
token into an old model often causes the new data 
to be applied in a way which gives it less weight 
than if a model were trained from several tokens 
including that new token all at one time. 

20 The acoustic tokenstore used with the embodi- 

ment of FIG. 1 is a circular buffer of tokens which 
contains the 50 most recently saved tokens. In 
alternate embodiments, an even larger circular 
buffer could be used. In the preferred embodiment, 

25 if there is only one new token available for a 
particular word, then traditional adaptive training is 
used. That is, the data from the new token is 
averaged with the data of the old model. 

As is well-known to those skilled in the art of 

30 hidden Markov model speech recognition, training 
of model building can be done using dynamic 
programming routines which are very similar to the 
recognition match routine 129 described above. A 
schematic block diagram of such a dynamic pro- 
as gramming training routine is given in FIG. 5. The 
dynamic programming analysis is broken into two 
phases. Step 218 is a forward scoring routine simi- 
lar to the discrete utterance match routine shown in 
FIG. 4. This forward scoring pass incrementally 

40 computes the best paths through the Markov 
models, for all partial utterances. The second 
phase, step 220. traces back through the partial 
paths which were computed during the forward 
scoring to find the globally best path. When used 

45 in the training of discrete word models, this glo- 
bally best path is the path through the entire word 
chosen by the dynamic programming, and it asso- 
ciates successive groupings of one or more frames 
130 from each training utterance of the word with 

50 the successive nodes 133 of the word model 132 
being built, as in FIG. 8. 

The block diagram shown in FIG. 5 may be 
applied to either building acoustic models, as is 
performed in step 214 of FIG, 1, or, as will be 

55 explained in greater detail below, with the slight 
modification shown in FIG. 5A it can be applied as 
a match routine for connected or continuous 
speech. 
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The forward scoring step 218 is sho-AEnr in more 
detail in FIG. 6. Rrst. in step 224. each node is 
initialized (step 226) to BAD_SCORE, which cor- 
responds to a probability of zero. The array OLD- 
[NODE] holds the scores from the previous 
FRAME, while the array NEW[NODE] holds scores 
being computed for the current FRAME. In step 
228. the START__NODE is initialized to 
BEST_SCORE which, in the preferred embodi- 
ment, would be zero, representing a probability of 
one that the process starts in state START_NOOE. 

The loop of Step 230 is then iterated for each 
FRAME in the utterance. For each FRAME, the 
loop of step 232 is iterated for each ARC in the 
network. For each ARC^ step 234 takes the OLD 
score for the left-hand node of the ARC and adds 
the Markov transition probability score for the par- 
ticular ARC. Since the scores are negative loga- 
rithms of probabilities, adding scores corresponds 
to multiplying probabilities. The PASS score which 
has been computed as this sum is compared with 
the NEW score for the right-hand node for the 
current ARC. If the PASS score is better, step 236 
uses It to replace the NEW score for the right-hand 
node, and stores traceback information in the array 
BACK. Thus for each NODE, NEW[NODE] will 
eventually have the score of the best path leading 
to NODE for the current frame, and BACK- 
[NODE.FRAME] will identify the node which in the 
previous FRAME was the node on that best path. 

After Step 232 is completed for each ARC in 
the network, the loop of step 238 is iterated for 
each NODE. For each NODE, step 240 adds the 
observation probability score. This observation 
probability score is the same as discussed in step 
163 of FIG. 4. Step 240 also re-initializes the array 
of NEW scores to BAD "SCORE. 

Step 220 of FIG. 5 is shown in more detail in 
FIG- 7. It traces back through the best path which 
leads to END_NODE at the end of the utterance. 
END_FRAME, Step 251 starts the trace back at 
NODE END_NODE and FRAME END_FRAME. 
Step 252 then traces back until getting to the start 
at FRAME = 0. For each* frame tracing backwards, 
step 253 is executed. 

Step 253 consists of two parts. In step 254, 
statistics are recorded. In step 255 the variable 
NODE is set equal to BACK(NODE.FRAME), the 
node the time alignment of the forward pass 218 
associated with the frame preceding the current 
frame FRAME, and then the current FRAME is set 
equal to what had been the frame preceding the 
current frame. 

When step 220 is being used in acoustic 
model building (step 214 of FIG. 1), the statistics to 
be recorded are acoustic parameters of the current 
FRAME (which m\\ be a vector of parameters if 
vector quantization is not used in step 118 of FIG. 



2, but will be just an index into the code book if 
vector quantization is used), associated with the 
conditional probability distributions for the label- 
(WORD.NODE) of the current NODE of the WORD 
5 being trained. 

In step 222. the models are re-estimated using 
a variation of the well-known Baurh-Welch algo- 
rithm, or more generally using the EM algorithm, 
which is well-known to those skilled in the arts of 

JO statistics and. more particularly in the art of hidden 
Markov modeling. The re-estimation essentially 
consists of a maximum likelihood estimation of the 
statistical parameters which characterize the prob- 
ability distributions. If the system has existing 

;5 models for the word or phonemes for which a new 
model is being built and there is only one token for 
the word in the tokenstore. then the accumulated 
statistics from previous training are combined with 
the statistics from the token in the acoustic token 

20 store used in step 214 to make a combined maxi- 
mum likelihood estimate of the statistical param- 
eters. As is well-known to those skilled in the art, 
this operation will allow adaptive training of the 
acoustic models, as well as zero-based training if 

25 no previous model exists. 

Referring now to FIG. 8, a very simplified sche- 
matic representation of the training algorithm just 
described is shown. In it the acoustic descriptions 
131a and 131b of the spoken words "above" and 

30 "about", respectively, are used to train acoustic 
word model 132a and 132b of those words. For the 
initial pass of the model building, if no previous 
model existed for the word, an initial model could 
be used which is comprised of a desired number of 

35 nodes produced by dividing the frames 130 of 
each acoustic description, such as the token 131a, 
into a corresponding number of essentially equal 
sized groupings, and then calculating the model of 
each node from the frames of its corresponding 

40 grouping, as is shown in the formation of the up- 
permost word model 132a shown In FIG. 8. Then 
this first word model is time aligned against the 
acoustic description 131a by the forward scoring 
218 and the traceback 220 of FIG. 5. Then the 

45 frames 130 of this description which are time 
aligned against each of the initial nodes 133 are 
used to make a new estimate of that node, as is 
shown by the groupings in the second view of the 
token 131a in FIG. 8, and the arrows leading from 

50 each such grouping to its corresponding node 133 
in the second calculation of the word model 132a. 
This cycle of forward pass 218, traceback 220, and 
model re-estimation 222 for the word model 132a 
is shown being performed once more in FIG. 8. In 

55 the preferred embodiment, this model building cy- 
cle of steps 218. 220, and 222 would be executed 
iterativeiy for 8 passes. 

FIG. 8 also is used to show how, in an alternate 
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embodiment of the invention, a backup dictionary 
including the phonemic pronunciation of each word 
can be used. The dictionary 500 shown in FIG. 3. 
and used in the embodiment of the invention de- 
scribed with regard to FIG. 1, merely consists of an 
alphabetical listing of word spellings 504 which can 
be used as machine responses if such spellings 
are selected by the user and confirmed for output 
to the application program. In the alternate embodi- 
ment using a phonemic dictionary 500a, each of 
the word entries 504a in the phonemic dictionary 
has associated with it a spelling comprised of pho- 
nemic symbols 506. There are well known methods 
in the art of speech recognition for producing an 
acoustic model of a word for which a phonemic 
spelling has been provided by stringing together 
models For each of the phonemes in that word's 
phonemic spelling. Models for each such phoneme 
can be derived by building acoustic models of 
words having a known phonetic spelling, in the 
manner shown in the top half of FIG. 8, and then 
using the acoustic node models 133 associated 
with each phoneme in that model as a model of its 
associated phoneme. Preferably data from a plural- 
ity of nodes 133 associated with the speaking of a 
given phoneme 506 in several different words are 
combined, as is shown in FIG. 8 to produce the 
phoneme model 508 associated with that phoneme. 
As those- skilled in the art are well aware, the 
making of such phoneme models requires that the 
phonemic spelling used be one which corresponds 
well within the individual nodes which tend to be 
formed in the training of the acoustic word models 
used to obtain data for the formation of tJ}ose node 
models. As is also known in the speech recognition 
arts, phonemic models give improved performance 
if those models take into account the phoneme 
preceding and following a given phoneme in ques- 
tion. Thus in the example of FIG. 8. the schwa 
sound for which a phonetic model is shown being 
made is a schwa occurring at the start of a word 
before a "b". 

In such an embodiment, once a model is made 
by the method shown in FIG. 8 for each phoneme, 
an acoustic mode! can be made for each word in 
the phonetic dictionary 500a. In such an embodi- 
ment the recognition of step 123 can be performed 
" against phonemic acoustic models which have 
been produced from the phonemic spellings in the 
acoustic dictionary without having the user individ- 
ually train their associated word, as well against 
acoustic models of words which have been individ- 
ually trained. When recognition is being performed 
against phonemic as well as against individually 
trained acoustic models, it is desirable that a lan- 
guage model be used to limit the recognition vo- 
cabulary to a number of words which the recog- 
nizer can handle in a response time which the user 



finds satisfactory. This is done by limiting the rec- 
ognition vocabulary to such a number of words 
which the language models considers most likely 
to occur in the current language context. 

5 Returning now to FIG. 1. after steps 208. 214. 

and 216 have stored the token associated with the 
just confirmed word, and made and stored a new 
acoustic model for that word based at least in part 
on that token, the program advances to steps 178. 

w 180 and 188. described above, which store the 
confirmed word in the language context buffer, use 
the confirmed word to update the language model 
probabilities, and output the confirmed word to the 
application program. Then, in step 102, the pro- 

15 gram flow returns to the application program. The 
application program responds to the output of the 
ASCII string representing the confirmed word, and 
then causes the program flow to advance to the top 
of FIG. 1 to get the next input from the user. 

20 If, when the dictation program advances to the 
tests of steps 106-110. the input detected from the 
user corresponds to a letter command, the test of 
step 109 is satisfied and the program branch under 
that step is executed. The letter commands are 

25 commands which either instruct a letter to be ad- 
ded to a string called the STARTSTRING. or which 
specify an editing function to be performed upon 
the STARTSTRING. The STARTSTRING is a string 
which specifies the initial letters of the word to be 

30 recognized (i.e., the word represented by the token 
in REC_TOK) 

When the system is in TEXTMODE. the only 
letter commands allowed are either the pressing of 
any letter key or the speaking of "starts_alpha". 

35 "starts_beta". etc. which correspond to the press- 
ing of "a", "b". etc. These commands indicate the 
first letter of the STARTSTRING, and they cause 
the system to be switched to EDITMODE, in which 
the recognition performed at the top of FIG. 1 has 

40 its vocabulary limited to letter commands, the de- 
lete command discussed above with regard to step 
106, the pick-choice commands discussed above 
with regard to step 107, and the edit-choice com- 
mands discussed below with regard to step 110. In 

45 the EDITf\/IODE, the letter commands include com- 
mands for adding another letter to the end of the 
STARTSTRING either by pressing a letter key or 
speaking of a word from the communications al- 
phabet which is associated with such a letter (i.e., 

50 "alpha" for "a", "beta" for "b", "Charlie" for "c". 
etc.). In the EDITMODE. the letter commands also 
include a command for deleting the last letter ad- 
ded to the STARTSTRING either by pressing the 
"backspace" key, or by saying "backspace". Limit- 

55 ing the recognition vocabulary in EDITMODE im- 
proves the speed and accuracy of its recognition. 
Also, it lets the commands for adding individual 
letters to the STARTSTRING be changed from 
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"starts^alpha" through "starts_2ulu" to the more 
quickly said "alpha" through "zulu". This change is 
made posstbte by the fact that the EDITMODE's 
restricted vocabulary does not contain text words 
which are the same as. or confusingly similar to. 
the words of the communications alphabet. 

Once Step 109 detects that the user has en- 
tered a letter command, it advances to step 260. 
which sets the system's recognition mode to the 
EDITMODE so that the user can further add or 
subtract letters from the edit string, as described 
above. Then in step 262 the program adds the 
specific letter indicated by the letter command to 
the current STARTSTRING. or performs the spe- 
cific editing command upon the STARTSTRING 
specified by that letter command. When test 109 is 
first passed for a given token in REC_TOK, the 
STARTSTRING will have been previously cleared 
either by step 101, step 192. or step 204. When 
that test is passed for the second or subsequent 
time for a given token in REC_TOK, the START- 
STRING will normally already have one or more 
letters in it. and the new letter is added to the end 
of that string. 

After the step 262 has been performed, the 
program advances to step 264, which saves the 
token stored in REC_TOK in SAV_TOK. This is 
done so that if the user later confirms that the 
spoken word represented by the token in 
REC_TOK corresponds to a given word choice, 
that token will be used by steps 208, and 214, 
described above, to make a token and a model of 
that confirmed word. 

Then step 266 restricts the active vocabulary to 
be used by the rerecognition step 272, described 
below, to words which start with the current STAR- 
TSTRING. Techniques are well-known by those 
skilled in the computer arts for matching the spell- 
ing of each word in a vocabulary list against a 
string of initial characters and selecting only the 
words that start with the indicated initial string. 

Then step 268 retrieves a list of vocabulary 
words from a backup dictionary which also start 
with the specified initial string, but which are not in 
the active vocabulary of words having acoustic 
models selected by step 266. In the embodiment 
of FIG. 1 . the backup dictionary 500 (shown in FIG. 
3) is a large dictionary with 80,000 or more entries. 
Step 268 uses probabilities from the language 
model to select, from the words in the backup 
dictionary which start with the current STARTSTR- 
ING, the up to nine words which are the most likely 
to be used in the current language context. 

Next step 270 displays the current START- 
STRING in a Definition Window which shows the 
operator the initial string which has been specified 
so far. The Definition Window is shown as box 703 
in the example illustrated in FIG. 10-24. It is called 



a Definition Window because, when a new word is 
added to the vocabulary of acoustic word models, 
its spelling is defined by being entered into the 
Definition Window through a combination of letter 

5 commands and/or edit-choice commands, as will 
be described below. 

After the STARTSTRING is displayed in the 
Definition Window, step 272 performs recognition 
upon REC_TOK with the words eligible for selec- 

jo tion limited to the restricted active vocabulary pro- 
duced by step 266, described above. The rerecog- 
nition uses the same speech recognition algorithm 
described above with regard to step 123. except 
that the recognition is performed upon the token in 

;5 REC_TOK rather than that in TEMP_TOK. the 
active vocabulary is restricted, and a watchdog 
routine is regularly polled during recognition so that 
the recognition will be interrupted if there is any 
user input. 

20 Once the recognition of step 272 is complete, 

step 274 displays the best choices from that rec- 
ognition in the active window 701 (except nothing 
new is displayed in this step if the recognition was 
interrupted by new user input), in the same num- 

25 bered ordered as the best choices produced by the 
recognition performed in step 123. The only dif- 
ference is that in step 274 (a) the active window 
701 is displayed below the definition window 703, 
rather than directly beneath the current cursor posi- 

30 tion of application program, and (b) a second win- 
dow, called the dictionary window 702, is displayed 
directly below the window 701. The dictionary win- 
dow contains enough of the backup words selected 
from the backup dictionary by step 268. to bring 

35 the total number of choices presented in both the 
active window 701 and the dictionary window 702 
up to a total of nine words. Of course if there are 
not enough recognition choices and backup words 
selected by steps 272 and 268. respectively, to 

40 provide a total of nine choices in both windows, a 
lower total number of choices will be presented. 

After step 274 finishes its display of word 
choices, the program again advances to the top of 
FIG. 1 to get new input from the user. 

45 In the alternate embodiments described above 
in which a phonemic dictionary is used, the step 
268 can be dispensed with, and the subvocabulary 
of words which start with the startstring selected by 
step 266 includes both words for which there are 

50 individually trained, as well as phonemic, acoustic 
words models. If this vocabulary of such words 
which start with the STARTSTRING is too large to 
be recognized with a satisfactory response time, 
the language model can be used to further limit 

55 this vocabulary to a number of words which can be 
recognized sufficiently promptly. In such an em- 
bodiment the rerecognition of step 272 is per- 
formed upon both the restricted vocabularies of 
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both individually trained and the phonemic acoustic 
word models, and step 274 displays the choices 
from both vocabularies. 

If, when the program advances to the tests in 
steps 106-110, the input detected from the user is 
an edit-choice command, test 110 is satisfied and 
the branch of the program of FIG. 1 below that step 
is executed. Edit-choice commands are commands 
which select a given one of the word choices 
displayed in the active window 701 or the dic- 
tionary 702 and cause the spelling of that word to 
be made the STARTSTRING, just as if the user 
had entered each of the letters in that spelling 
through a sequence of letter commands. The edit- 
choice commands include the double-clicking of 
any of the function keys "fl" through "fG" cor- 
responding to a word choice displayed in either 
window 701 or 702 or the speaking of an utterance 
"edit_one" through "edit_nine" corresponding to 
the number of such a function key. 

As can be seen from FIG. 1. the steps per- 
formed in response to an edit-choice command are 
identical to the steps performed in response to a 
letter command, except that step 278, which sets 
the STARTSTRING equal to the letters of the cho- 
sen word, differs from step 262. 

The edit-choice commands are very useful be- 
cause when the active window 701 and the dic- 
tionary window 702 fail to display a desired word, 
they often display a word which starts, but does 
not end, with the characters of the desired word. 
The double-click function lets the user enter such a 
similar word into the definition window, where the 
user can delete undesired characters if necessary 
and then add desired letters using the letter com- 
mands. 

After steps 276, 278 and 264 - 274 have been 
completed in response to an edit-choice command, 
the program again returns to the top of FIG. 1 to 
get the user's next input. 

Rgures 10-23 show how the preferred embodi- 
ment operates for a sample phrase. In this exam- 
ple, the speech recognition system begins with an 
empty vocabulary list. That is, it is assumed that 
the user has not yet trained up acoustic models for 
any words. The user wants to enter the phrase 
"This invention relates to", such as at the begin- 
ning of the specification of this patent application. 

Turning to FIG. 10, first the user says the word 
"this". Step 111 detects this utterance and step 
1 1 9 causes an acoustic representation, or token, of 
the utterance to be saved in TEMP_TOK. Then, 
since the program starts in TEXTIVIODE. step 123 
calls the speech recognition algorithm described in 
FIG. 4, Since the active vocabulary is initially emp- 
ty (that is, there are no words with acoustic models 
against which recognition can be performed), the 
recognizer produces no best choices. As is de- 



scribed above, step 108 assumed the utterance of 
a word for which the recognizer can find no match- 
es is an utterance of a text word, and thus causes 
the test of that step to be satisfied in response to 

5 such a recognition failure. This causes steps 170 - 
188 to be performed. Step 170 sets the system to 
TEXTMODE, which it is already in. Step 172. 
stores the unrecognized utterance's token in 
REC_TOK. Step 174 does nothing, because there 

10 is no prior top choice displayed, and step 176 
displays the choices from the recognition. Of 
course, since the choice list produced by the rec- 
ognition step 123 is empty, the Active Choice Win- 
dow 701 displayed by step 176 contains only the 

15 option "del [reject]". This indicates that the oper- 
ator may press the "delete" key to abort the utter- 
ance, as shown in steps 106, 190. 192, 194, and 
196 of FIG. 1. After the choices are displayed, test 
803 is satisfied because SAV_TOK is still empty, 

20 and the program advances to steps 178. 180. and 
188. Since there is no prior confirmed word, these 
steps do nothing but return to the application pro- 
gram, which in our example is a word processor. 
Since step 188 provides no output to the word 

25 processor, it does nothing in step 102, except 
return to the polling loop 103 of the program of 
FIG. 1 to get the next user input. 

In our example, the operator responds by typ- 
ing the letter "t". This causes steps 104 and. since 

30 test 801 is satisfied. 105 to get that keystroke and 
supply it to the tests of steps 106-110. Since "t" is 
a letter command, the test of step 109 is satisfied, 
which causes steps 260 - 274 to be executed. Step 
260 changes the system's recognition mode to 

35 EDITMODE. Step 262, adds the letter "t" to the 
previously empty STARTSTRING. Step 264 saves 
the token currently in REC_TOK in SAV__TOK. so 
it can be used for the training of a word model if 
the user confirms a word for that token. Step 266 

40 restricts the active vocabulary to word models 
whose corresponding spellings begin with the cur- 
rent STARTSTRING, "t". Of course, at this time in 
our example there are no trained word models, so 
this active vocabulary is empty. Then step 268 

45 finds the list of words in the backup dictionary 
which begin with the current STARTSTRING "t", 
and picks the nine of those words most likely to 
occur in the given language context according to 
the language model. In the example the language 

so context buffer is currently empty, so the language 
model selects from the backup vocabulary based 
on the context independent word probabilities in- 
dicated in step 184 of FIG. 5. 

After step 268 picks nine backup words, step 

65 270 displays the definition window 703 shown in 
FIG. 11, step 272 does nothing since the active 
vocabulary for rerecognition is empty, and then 
step 274 displays the empty active window 701, 
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and all nine of the backup words picked in step 
268 in the dictionary window 702. Then the pro- 
gram advances back to the fop of FIG. 1 to wait for 
the next input from the user. At this point, the 
computer display appears as shown in FIG. 11. 

In FIG. 11. the correct word "this" appears in 
the Dictionary Window 702. and is associated with 
function key "f7". In our example it is assumed that 
the user next types function key "f7", to confirm 
the word "this". As a result, steps 104 and 105 get 
the keystroke, the keystroke causes the test of step 
107 to be satisfied. This causes steps 200, 202. 
204. 206, 208. 214. 216. 178. 180, and 188 to be 
performed. Step 200 changes the recognition mode 
back to TEXTMODE, because now that the user 
has confirmed a word for the last text word, it is 
assumed he will utter another text word. Step 202 
saves the token REC_TOK in SAV__TOK. In this 
instance, this has no effect since this token has 
already been saved in SAV_TOK by step 264. 
Step 204 clears REC_TOK and STARTSTRING 
and pop-up windows 701, 702. and 703 to prepare 
for the recognition of another text word. Step 206 
confirms the indicated word "this" as the word 
corresponding to the last text word, that repre- 
sented by the token in SAV_TOK, Step 208 stores 
the token in SAV^TOK in the tokenstore. labeled 
with the confirmed word "this". Step 214 checks 
the tokenstore for all tokens labeled with the word 
"this", of which there is currently only the one just 
stored, and then forms an acoustic word model for 
that word using that token. Step 216 stores this 
word model in the acoustic model store used by 
the recognizer. 

After step 216 stores the new acoustic model 
for the word "this", step 178 stores the confirmed 
word "this" as the most recent word in the lan- 
guage context buffer. Step 180 then updates the 
language model probabilities to reflect the occur- 
rence of the word "this". Then step 188 outputs 
this word to the application program as if it had 
been typed on the computer system's keyboard. 
By techniques well-known to those skilled in the 
arts of computer word processing, the dictation 
program keeps track of punctuation rules, such as 
capitalizing the first word of a sentence. Thus, the 
keystrokes sent to the application by step 188 in 
the preferred embodiment would include an upper 
case "T". The user, however, types ail in lower 
case except for words, such as proper names, 
which are always capitalized. 

Once step 188 produces this output string, it 
returns control to the word processor application 
program at step 102. This word processor re- 
sponds to the string "This" output by step 188 by 
inserting that string into the text it is creating and 
onto the screen as indicated in FIG. 12. 

After all these steps are performed, the pro- 



gram advances again to polling loop 103 to get the 
next input from the user. In our example, the user 
says the word "invention". In response, steps 111, 
119. 121. and 123 are performed, causing an 

5 acoustic match of the token of that utterance 
against each word in the active vocabulary. Since 
in our example the word "this" is the only word for 
which an acoustic model has been trained, it is the 
only word currently in the active vocabulary, and 

JO thus it is supplied as the input to the tests of steps 
106-110. This causes the test of step 108 to be 
passed and the branch of the program below that 
step to be performed. Most importantly for our 
purposes, step 172 causes the token of the word 

15 just Spoken to be saved in REG_TOK and step 
176 displays the active choice window 701. as 
shown in FIG, 13. Since there were no word 
choices displayed when the word "invention" was 
spoken, step 174 does not confirm any word 

20 choice and steps 803. 178. 180. and 188 do not do 
anything except return control to the word process- 
ing application. The application, since it has not 
received any output from step 188, merely returns 
to step 103 of the dictation program for more user 

25 input. 

At this point in our example the user types the 
letter command "i" to indicate that the utterance to 
be recognized begins with that letter. In response, 
the test 109 is satisfied. STARTSTRING is set to 

30 "i". the token in REC__TOK is saved in SAV_TOK 
for potential use in making a word model, and the 
active vocabulary is restricted to words starting 
with "i". In the current state of the example, since 
no acoustic models have been trained for words 

35 that start with "i". the active vocabulary is thus 
made empty. Step 268 then retrieves a backup 
vocabulary of the nine most likely words to occur in 
the current language context which start vvith the 
letter "i". Then step 270 displays the START- 

40 STRING "i" in the definition window 703 as shown 
in FIG. 14. Step 272 does nothing since there is no 
active vocabulary for rerecognition. Step 274 dis- 
plays the resulting empty active choice window 
701, and the dictionary window 702 which contains 

45 the most likely "i" words selected by step 268. 
Then the program returns to the step 103 for the 
next user input. At this time the computer display 
has the appearance shown in FIG. 14. 

In our example it is assumed that the user next 

50 types the letters "n", "v", "e" in successive passes 
through the sequence of steps 104, 105. 109, 260. 
262. 264. 256. 268, 270. 272. In the embodiment of 
FIG. 1 , the speech recognition algorithm performed 
in step 272 is provided with a watchdog capability. 

55 This causes the algorithm to check at a relatively 
high frequency, in terms of human perception, 
such as at ten times a second, for the detection of 
new Input from the user, either in the form of a 
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keystroke or an utterance. If such new input is 
detected before the rerecognition of step 272 is 
completed, that step will be aborted and the pro- 
gram will jump immediately to polling loop 103 to 
get the user's new input This allows the user to 
rapidly enter a sequence of letters or other editing 
commands without having to wait for the system to 
complete the relatively lengthy computation of a 
rerecognition after each such entry." But this 
scheme will allow the system to complete rerecog- 
nition whenever the user pauses long enough after 
the entry of the last letter fpr such a rerecognition 
to take place. Thus the user can type a few letters, 
stop and see if that selects a few words, and if not 
type a few more letters or the remainder of the 
word. 

After the user has entered the characters "nve" 
and the branch of the program under step 109 has 
had a chance to respond to the resulting START- 
STRING, the computer display has the appearance 
shown in FIG. 15. The definition window 702 dis- 
plays the current STARTSTRING "inve". and the 
dictionary window 702 has a list of words beginning 
with "inve". including the correct word "invention", 
associated with function key "f5". 

In the example, the user then presses the 
function key "fS" as a pick-choice command to 
select the word "invention" as the confirmed word. 
As a result, the test of step 107 is satisfied and the 
branch of the program under that step is executed. 
Step 206 causes the word "invention" to be con- 
firmed. Steps 208, 214, and 216 cause the token of 
the utterance associated with that word to be saved 
in the token store and an acoustic model of the 
word "invention" to be made using that token. Step 
178 causes the word "invention" to be stored as 
the most recent word in the language buffer and 
the word ''this" to be stored as the next to the 
most recent word in that buffer. Step 180 uses the 
two words in the language buffer to update the 
language model probabilities, and finally step 188 
causes the keystrokes " invention" to be sent to 
the application. (Notice that the system has auto- 
matically, as one of its punctuation rules, inserted a 
space before the word "invention" in the key- 
strokes as sent to the application.) Then, in step 
102, the application program inserts this sequence 
of keystrokes into the text it is creating, as in- 
dicated in FIG. 16, and then returns to polling loop 
103 for the user's next input. Notice that the sys- 
tem is now in a state in which no windows are 
displayed. If the user enters any keystrokes at this 
points, the test in step 801 will not be satisfied and 
step 802 will be executed and the keystroke will be 
sent directly to the application. To the user, this 
mode would be like ordinary typing. 

Next the user says the word "relates", which 
causes steps 111, 119, 121 and 123 to be per- 



formed. At this point there are two words, "this" 
and "invention" in the active vocabulary. However, 
"invention" is such a poor acoustic match for 
"relates" that it is not in the choice list produced 
5 by the recognition in step 123. "This" is the best 
choice produced by the recognition and it causes 
the test 108 to be passed, which causes step 176 
to make the active window display shown in FIG. 
1 7. Since there were no word choices displayed at 
10 the time the user spoke the word "relates", step 
174 does not confirm any word and thus steps 803, 
178, 180, 188, and 102, have no effect but to return 
to step 103 to await the user's next input. 

The user then types the letters "rel", in re- 
is sponse to each of which the branch of the program 
beneath step 109 is executed. After this branch of 
the program is executed for the last letter "I" in this 
sequence, the computer display has the appear- 
ance shown in FIG. 18. The then current START- 
20 STRING "rel" is shown in the definition window 
703. Since there are no acoustic word models 
associated with spellings which start with the letters 
"rel". the active choice window 701 is blank. The 
dictionary window 702 contains a list of words 
25 beginning with the letters "rel", with "relate" as the 
first choice. 

After the display shown in FIG. 19 is made the 
program advances to polling loop 103 to wait for 
the user's response. In the example, the user now 

30 double-clicks (presses twice quickly) the function 
key "f1". In response, step 105 detects the double 
click and supplies a double click character cor- 
responding to "f1" to the tests of steps 106-110. 
This causes the test of step 1 10 to be satisfied and 

35 the branch of the program beneath that step to be 
executed 

As a result, step 276 sets the program to 
EDITMODE. the mode it was already in. Step 278 
makes the word "relate" the STARTSTRING. Step 

40 264 saves the token in REC__TOK in SAV_TOK. 
which has no effect since this has previously been 
done for each of the letters "rel" typed in by the 
user. Step 266 leaves the already empty active 
vocabulary empty. Step 268 restricts the backup 

45 vocabulary to the two words "relate" and "related", 
since the desired word "relates" is not a separate 
entry in the dictionary. Step 270 displays the 
STARTSTRING "relate" in the definition window 
703. Step 272 does nothing since the active vo- 

50 cabulary is empty. And step 274 displays the win- 
dows 701 and 702, before the program advances 
to polling loop 103 for more user input. This se- 
quence of steps produces the display shown in 
FIG. 19. 

55 The user then types the letter "s" to complete 
the word "relates". Step 109 responds to this letter 
command by causing step 262 to add this letter to 
the STARTSTRING. Step 270 produces the display 
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shown in FIG. 20. Then the program advances 
through steps 264. 266. 268, 270. 272. and 274 for 
the same token, and then to pooling loop 103. 
where it waits for the user's response. As FIG. 20 
indicates, step 270 displays the STARTSTRING 
"relates" in the definition window 703, but there is 
no active window 701 or dictionary window 702 
displayed, because there are no words in either the 
active vocabulary or in the dictionary which begin 
with the letters "relates". In the preferred embodi- 
ment illustrated in this example, the DELETE com- 
mand is only active in TEXTMODE, not in EDIT- 
MODE. so "del[reject]" is not displayed. 

Since the word in the definition window 703 in 
FIG. 20 is correct, the user presses the "enter" key 
(which is called the "return" key on some key- 
boards). This key is interpreted by the program as 
a pick-choice command for selecting the word in 
the definition window as the confirmed word. In 
response to this input, the test of step 107 is 
satisfied, and the branch of the program beneath 
that step is executed. Most importantly, step 200 
returns the program to TEXTMODE. the START- 
STRING "relates" is made the confirmed word, the 
token in SAV_TOK corresponding to the utterance 
of that word is saved in the tokenstore with the 
label "relates" and is used to make an acoustic 
model of the word "relates". The "relates" is 
stored in the language context buffer as the most 
recent word and "invention" is moved to the sec- 
ond most recent word. These two words are then 
used to update the language model probabilities, 
and "relates" is output to the word processor ap- 
plication, which inserts that string into its text at the 
cursor location as indicated by FIG. 21. Then the 
program returns to polling loop 103 for the user's 
next input. Again the system is in a state in which 
the test in step 801 would not be satisfied, so the 
user could do normal typing. 

At this point the user says the next word "to". 
This causes the TEXTMODE recognition of step 
1 23 to be performed upon the token resulting from 
that utterance. The best matching word in the ac- 
tive vocabulary is the next word "this", so the test 
of step 108 is met and "this" is displayed in the 
active choice window 701 . 

The user then types the letter "t". which 
causes the branch of the program beneath step 
109 to be performed. In response, the START- 
STRING is set to the letter "t". The active vocabu- 
lary is restricted to words beginning with "t". a 
backup vocabulary of "t" words is selected, and 
rerecognition is performed on the active vocabu- 
lary. Since at this point the active vocabulary con- 
tains one word beginning with "t", the word "this", 
"this" is selected as a best choice of the rerecog- 
nition. After this rerecognition the computer screen 
has the appearance shown in FIG. 23. The correct 



word "to" is the choice associated with the function 
key "f2" in the dictionary window 702. Thus the 
user presses the function key "f2" to confirm the 
word "to" producing the display shown in FIG. 24. 

5 Notice that by speaking each of the four words 

only once, with no explicit training or enrollment, 
and using a total of only 15 keystrokes (16 key- 
strokes if the double-click is counted as two), the 
user has entered the text "This invention relates 

10 to" (which would normally take 25 keystrokes by 
itself), and has created four acoustic models so 
that when any of these words occurs again later it 
can probably be entered with no keystrokes at all 
(because a correct recognition can be confirmed 

15 by beginning the next utterance, which wilt cause 
the test of step 108 to be satisfied and the branch 
of the program beneath that step to be executed.) 

The user does not have to train an initial enroll- 
ment vocabulary. To add a new word to the vo- 

20 cabulary, it is only necessary to use the word in 
the actual course of generating a document. With 
the aid of the backup dictionary, it . often takes 
fewer keystrokes to add a word to the active 
speech recognition vocabulary than it would take to 

25 type the word normally. 

It should be understood that if the user had 
trained up acoustic models of the pick-choice, let- 
ter and edit-choice commands used in the example 
above he could have evoked those commands by 

30 speaking them, causing basically the same result 
indicated above. 

Referring now to FIGS. 25 and 26, an alternate 
embodiment of the present invention is shown 
which is designed to recognize, and to enable 

35 users to correct the recognition of, speech spoken 
either as individual words separated by pauses or 
brief phrases of several continuously spoken 
words, each of which phrases is separated by a 
pause. The functional steps of FIG. 25 are exactly 

40 the same as, and have the same or corresponding 
numbering of those in FIG. 1 , except for the follow- 
ing: In steps 123a the speech recognition algorithm 
is a phrase recognition algorithm designed to rec- 
ognize an entire phrase of continuous or connected 

45 speech. In steps 192a. 204a. 262a, 278a. 266a, 
268a. 270a. the STARTSTRING involved is the 
PHRASE_STARTSTRING. which lists the initial 
letters of the utterance of one or more words 
currently being recognized. This 

50 PHRASE_STARTSTRING Itself can contain more 
than one word. The definition window 703a in 
-which the PHRASE_STARTSTRINQ is displayed 
is called the phrase definition window 703a (shown 
in FIG. 30). Step 176a and 274a display the entire 

55 recognized phrase in the phrase active window 
701a (shown in FIG. 27), so that the choices dis- 
played can be either single words, multiple word 
phrases, or a mixture of both. The steps 206a, and 
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174a. confirm the entire phrase choice selected by 
the user rather than just an individual word. The 
steps of 208a, 214a, 21 6a nnake tokens and acous- 
tic word nnodels for each word in the confirmed 
phrase rather than just for a single word. Step 178a 
stores all the words of the confirmed phrase in the 
language context buffer, rather than just a single 
confirmed word. Step iBOa updates the language 
model based on each of these confirmed words. 
Step 188a outputs all the words of the confirmed 
phrase to the application rather than just an individ- 
ual word. Finally, the greatest difference is that if 
the user picks the edit-choice command and the 
phrase selected for editing contains more than one 
word, a new test 302. shown In FIG. 25. will cause 
the program to make the jump 304 to the entirely 
new phrase edit routine shown in FIG. 26. 

Before we discuss FIG. 26, it should be pointed 
out, as was discussed above, a continuous speech 
algorithm suitable for use in the phrase recognition 
of step 123a would be very similar to the training 
algorithm discussed above with regard to FIGS. 5 
and 6. except that FIG. 5 should be replaced with 
FIG. 5A. The only differences between the training 
algorithm of FIG. 5 and the continuous speech 
algorithm of FIG. 5A is that the re-estimation of 
parameters 222 is left out in the continuous speech 
of FIG. 5A. and the statistics kept during the 
traceback step 220 are different In the two figures. 
When step 220 is used for continuous speech 
recognition, the statistics to be recorded consist 
merely of the sequence of words along the best 
path that is being traced back. This word sequence 
will become available in reverse order, but those 
skilled in the arts of computing know that It is easy 
to save such a word sequence and reverse it, to 
display the forward word sequence to the user and 
to send it to the application. 

It should also be noted, with regard to FIG. 25, 
that steps 226a and 268a restrict the active vo- 
cabulary and the backup vocabulary to words start- 
ing with the PHRASE_STARTSTRING, and if that 
string contains any spaces, step 266a will only 
select entries from the acoustic word model store 
and step 268a will only select entries from the 
backup dictionary which contain an identical string 
having the same spaces. Such multiword entries 
will normally be quite rare, causing the active vo- 
cabulary and the backup vocabulary chosen by 
these two steps normally to be empty If the 
PHRASE_^STARTSTRING contains more than one 
word. The recognition of step 272a is single word 
recognition, just as in FIG. 1. and it can only 
recognize a multiword phrase if an individual 
acoustic model has been trained for that phrase as 
if it were one word. 

The embodiment of the invention illustrated in 
FIG. 25. operates almost exactly like that of FIG. 1 



when the user speaks discrete utterances, with -e 
possible exception that some of the choices dis- 
played in active window 701a may contain multiple 
words. 

5 If. however, the user speaks a multiword 

phrase, and in response to the first recognition of 
that utterance the correct phrase is displayed as a 
choice in the phrase active window 701a. the user 
can confirm that choice In the same manner as 

10 when a correct choice is displayed by the embodi- 
ment of the invention shown in FIG. l . That is, the 
user can select such a choice by pressing the 
function key or saying the pick__choice command 
associated with the correct choice, or by uttering a 

;5 new utterance which is recognized as something 
other than a delete, pick-choice, letter, or edit- 
choice command. 

For example, if the user had said "I knew this 
play" and was presented with the choice display 

20 shown in FIG. 27. he could confirm that phrase 
merely by pressing the "f3" key, or saying 
"pick_three". If this was done, the "f3" keystroke 
would be detected and gotten by steps 104 and 
105. and that keystroke would cause the test of 

25 step 107 to be satisfied. Step 200 would set the 
program to TEXTMODE. Step 202 would save the 
token of the utterance "I knew this play" stored in 
REC_TOK in SAV__TOK for use in making word 
models. Step 204a would clear REC__TOK and all 

30 startstrings, including PHRASE_STARTSTRING 
for use in the next recognition. Step 204a would 
also clear pop-up window 701a. Then step 206a 
would confirm the indicated choice associated with 
the "f3" key. the choice "1 knew this play". Step 

35 208a would store in the tokenstore each portion of 
the token in SAV^TOK. which is time aligned 
against the acoustic model of each word in the 
confirmed string as a separate token, and would 
label each such token with the individual word 

40 against which it was time aligned, and then clears 
SAV_TOK. In the preferred embodiment of FIG. 
25, this time alignment is recalculated for step 
208a by using the phrase recognition algorithm 
described with regard to FIG. 5A and FIG. 6 to time 

45 align the sequence of acoustic models associated 
with the confirmed words against the entire token 
in SAV_TOK, unless such a time alignment has 
been stored for the token in SAV_TOK by the 
steps in FIG. 26. 

50 Once step 208a has stored in the tokenstore an 
acoustic token for each of the words in the con- 
firmed phrase, step 214a builds an acoustic word 
model for each of the words in the confirmed 
phrase, using each of the one or more acoustic 

55 tokens stored in the tokenstore with that word as a 
label. Step 216a stores each such newly built word 
model in the acoustic word model store for future 
use by the speech recognition system. After this is 
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done, step 178a stores each of the words in the 
confirmed phrase in order in the language context 
buffer, which in 'the embodiment of FIG. 25 is 
designed to hold as many words as can be said in 
the size phrase the system is designed to hold, 
plus one previous word. Then step 180a uses each 
of the words in the confirmed phrase and the word 
that precedes it in that buffer to update the lan- 
guage model shown in FIG. 9. Finally step 188a 
outputs all the words in the confirmed phrase to the 
application program as if they had been typed into 
that application on the keyboard and returns to the 
application program for it to process those key- 
strokes. Then the application program returns to 
polling 103 at the top of FJG. 25 for the next input 
from the user. 

If. on the other hand, when the user speaks a 
multiword phrase the system fails to correctly rec- 
ognize it. he has two major ways to get the correct 
phrase entered with the embodiment shown in 
FIGS. 25 and 26. The first is to enter all the 
characters of the phrase with letter commands, 
either by typing them or by speaking them with the 
communications alphabet (with any spaces in the 
desired string being indicated by the word 
"space"), followed by confirming the resulting 
PHRASE_STARTSTRING by either pressing the 
"enter" key or speaking the word "enter_that". 
When the user types in such a 
PHRASE_STARTSTRING, the system of FIG. 25 
responds in almost exactly the same manner as 
when the user types in a STARTSTRING in the 
embodiment of FIG. 1. There are only two major 
differences: First, once the user types a space in 
the PHRASE_STARTSTRING both the active vo- 
cabulary produced by step 286A and the backup 
vocabulary produced by step 268a will be empty, 
unless there are acoustic word models or dic- 
tionary word entries with spaces in them. Second, 
when the user confirms a multiword phrase, the 
system may not have enough information to ac- 
curately time align each word of the confirmed 
phrase against SAV_TOK. 

This second difference is a little more difficult 
to understand, so let us explain it in greater detail. 
When the user has spelled the entire desired string 
and confirms it by speaking "enter^that" or press- 
ing the "enter" key, the branch of the program 
beneath test 107 in FIG. 25 confirms this utterance. 
This branch will respond to the confirmed phrase in 
the manner described above in which it responded 
to the confirmation of the phrase "I knew this play". 
The only difference is that if the step 208A had no 
acoustic models for one or more of the words in 
the confirmed multiword phrase, it might not be 
able to accurately time align each of those words 
against the token in SAV_^TOK, and thus might not 
be able to make a token and a model for each 



such word. In the embodiment of FIG. 25 this can 
be a problem whenever the confirmed phrase con- 
tains one or more words for which there is no 
acoustic model. In alternate embodiments of the 

5 invention where phonetic models are provided for 
each word in a large phonetic dictionary, this will 
only be a problem for words for which the phonetic 
dictionary provides no models. When this inability 
to properly time align all the words in the con- 

10 firmed phrase occurs, however, its only effect is to 
prevent tokens and acoustic models for the words 
which cannot be properly time aligned from being 
made. If the user wants to train a model of any 
such word he can still do so by speaking it as a 

;5 separate word, in which case steps 208a will have 
no problem time aligning that word against 
SAV_TOK, and thus steps 208a and 214a will 
have no problem making a token and a model of it. 
The second method of correcting an incorrectly 

20 recognized multiword phrase is to use the phrase 
edit routine of FIG. 26. This routine is called when 
the user Issues an edit-choice command to select a 
multiword phrase displayed in the phrase active 
window 701a. As was stated above, such a com- 

25 mand satisfies the test in step 110 of FIG. 25, 
causing step 276 to switch the system to EDIT- 
MODE. step 278a to make the chosen multiword 
phrase the PHRASE_STARTSTRING. and the test 
in step 302 to be satisfied, which jumps the pro- 

30 gram to the location J1 shown in FIG. 26, the start 
of the phrase edit routine. 

Referring now to FIG. 28, the phrase edit rou- 
tine shown there is designed to enable the user to 
select one of the individual words shown in the 

35 phrase definition window 703a which is incorrect, 
and to correct that word in the same basic manner 
that the program of FIG. 1 lets a user correct a 
given word shown in its definition window. As a 
result, the flow chart of FIG. 26 is in many ways 

40 similar to that of FIG. 1. 

Referring now to FIG. 26, after the program 
makes the jump to J1, step 402 clears two strings, 
the ALREADY_CGNFIRMED_STR!NG and the 
WORD_STARTSTRING. In the phrase edit mode 

45 the user is enabled to edit the phrase in the phrase 
definition window 703a in a left to right manner. 
The ALREADY_CONFIRMED_STRING stores 
those words from the phrase definition window 
have already been confirmed by the user at any 

50 given time. When step 402 clears this string it 
indicates that at that time none of the words in the 
phrase definition window have been confirmed by 
the user. The WORD_STARTSTRING is a start- 
string, very much like that used in the embodiment 

55 of the invention in FIG. 1, which stores the one or 
more letters of a desired word which is to replace 
the word of the PHRASE-STARTSTRING which 
has been selected for correction. In step 402 the 
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WOR0_STAflTSTRING is cleared to indicate that 
at this time no WORD_STARTSTRING has been 
specified by the user. 

After step 402. the program advances to step 
404 which forms a part of the phrase edit routine 
which is repeated each time the user changes a 
word in the PHRASE-STARTSTRING. Step 404 
displays the current PHRASE_STARTSTR1NG in 
the phrase definition window 703a. preceded by 
the ALREADY_CONFIRMED_STRING, which is 
shown with underlining. When the phrase edit 
mode is first entered, the 

PHRASE_STARTSTRING is the multiword phrase 
selected by the edit-choice command with caused 
the test of step 1 10 of FIG. 25 to be satisfied and 
the ALREADY_CONFIRMED_STRING is empty, 
so the phrase definition window merely displays 
the PHRASE_STARTSTRING selected by the 
edit-choice command. 

After step 404. step 406 makes the first word 
of the PHRASE_STARTSTRING the selected 
word, which includes showing It in reverse video in 
the phrase definition window. Then step 408 per- 
forms single word recognition against the token in 
REC__TOK. It starts this recognition at the part of 
that token which is time aligned against the start of 
the selected word. In a manner similar to that 
indicated above with regard to step 208a of FIG. 
25, such time alignment can be obtained by per- 
forming the recognition algorithm of FIGs. 5A and 6 
against that token using only the sequence of word 
models associated with the current 
ALREADY_CONFIRMED_STRING followed by 
the PHRASE_STARTSTR1NG. The single word 
recognition of step 408 is tike the discrete utter- 
ance recognition used in the embodiment of FIG. 1, 
except instead of the silence scores used in routine 
shown in FIG. 3 the current word is recognized 
embedded in a connected phrase. Thus instead of 
the initialization shown in step 151 of FIG. 4. the 
first node of each word model is seeded each 
frame by the score from the last node of the 
previous word of the connected phrase, as pre- 
viously computed, in a manner similar to that 
shown in FIG. 5A and 6. The observation probabil- 
ity obs_prob(label(WORD.LAST_NODE).FRAME) 
computed in step 163 would be a special score 
representing the average score for other speech 
following the current word, rather than a silence or 
background noise score. Thus the match routine 
would compute the match for each word given the 
context of the preceding words, if any, of the 
connected phrase which the user has already con- 
firmed, but leaving the right context unspecified. 

The recognition routine in step 408 not only 
picks a best scoring word, but also a list of next 
best scoring alternate words to be placed on a 
scored ordered choice list that will be displayed in 



a word active window 701b. in an alternate embodi- 
ment, the scores of all these words could be saved 
as the continuous speech match routine of step 
123a of FIG. 25 does its forward scoring, as shown 

5 in FIG. 6. In the preferred embodiment, the match 
routine in step 408 is called instead because it can 
also be used to recompute the word scores after a 
vocabulary restriction or augmentation, as wil! be 
done in step 409, described below. 

10 After the match routine of step 408 returns a 
choice list for the selected word position, step 410 
displays this choice list in a word active window 
701b. as shown in FIG. 30. and the program ad- 
vances to polling loop I03b at the top of FIG. 26 

J5 for the operator to select an option by providing 
input. The steps 103b. 104b, 105b. 111b. 119b. 
and 122b at the top of FIG. 26 are identical to the 
correspondingly numbered steps at the tops of 
FIGs. 1 and 25. They wait for, get, and in the case 

20 of an utterance, recognize, the user's input, and 
supply the input to a series of tests. This series of 
tests includes the "if deleted command" test 106b, 
the "if pick-choice command" test 107b, the "if 
letter command" test 109b, and the "if edit-word 

25 command" 1 10b, which correspond to the similarly 
numbered tests in FIGS. 1 and 25. It also includes 
two new tests, the "if word-selection command" 
test 412 and the "if enter-phrase command" test 
414' 

30 If, when the program advances to. polling loop 

103b, the user enters a word-selection command, 
the test 412 will be satisfied. In the embodiment of 
FIG. 26 the word-selection commands include the 
pressing of the left and right cursor keys, which 

35 select the word to the left or right of the current 
selected word, provided there Is such a word in the 
PHRASE_STARTSTRING. The word-selection 
commands also include the spoken commands 
"move_left__one", "move_left_two". etc. and 

40 "move__right_one". "move_right_two", etc. , 
which select the word the specified number of 
words to the left or right of the currently selected 
word, provided there is such a word in the 
PHRASE_STARTSTRING. 

45 If the test of step 412 is passed, the branch of 
the program beneath the step is performed. Step 
416 selects the word indicated by the word-selec- 
tion command, causing it and not the previous 
selected word to be shown in reverse video in the 

50 phrase definition window. Then steps 408 and 41 0, 
described above, are performed and the program 
advances to polling loop 103b for the next user 
input. It should be noted that the speech recogni- 
tion of step 408 is performed using a watchdog 

65 routine that terminates that recognition and jumps 
to loop 103b any time it senses new input from the 
user. This lets the user rapidly issue successive 
word-selection commands, without having to wait 
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for the completion of word recognrtion before each 
such successive command is responded to. 

If, when the program advances to polling loop 
I03b. the user enters a pick-choice command, the 
test 107b is satisfied and the branch of the pro- 
gram below that step Is performed. In the routine of 
FIG. 26 the pick-choice commands include the 
pressing of any of the keys "fl" through "(9" and 

any of the spoken commands "pick one" through 

"pick nine" corresponding to any of the word 

choices displayed in the word active window 701b 
or the word dictionary window 702b. in the phrase 
edit routine the pick-choice commands do not in- 
clude the pressing of the "enter" key or the speak- 
ing of the word "enter_that", since in this routine 
these "enter" commands are used to confirm the 
entire phrase in the phrase definition window, as is 
described below. 

Once the test of step 107b is satisfied, step 
418 substitutes the word picked by the pick-choice 
command for the current selected word shown in 
reverse video in the phrase definition window. For 
purposes of simplification, the phrase edit routine 
of FIG. 26 only lets the user correct words to the 
right of the last word of the PHRASE_STRING 
which has already been corrected. Thus once a 
word has been corrected by step 418, step 420 
adds the corrected word and all the preceding 
words in the current PHRASE_STARTSTRING to 
the ALREADY_CONFiRIVlED_STRING, so that 
those words will no longer be in the 
PHRASE_STARTSTRING for further editing. After 
this has been done, step 422 redisplays the phrase 
definition window 703a, with all the words in the 
ALREADY_CONFIRMED_STRING underlined, fol- 
lowed by a space, shown in reverse video, for the 
new PHRASE_STARTSTRING to be rerecognized 
in step 424. Then step 424 performs phrase rec- 
ognition against REC_TOKEN, starting with the 
portion of that token time aligned against the pre- 
vious selected word in the recognition that pro- 
duced that previous selected word. The new se- 
lected word, that chosen by the most recent pick- 
choice command, is used as the first word of all 
candidate phrases used in the phrase recognition 
of step 424. This re-recognition is performed so 
that when an error in one word of a connected 
phrase causes the next word to be misrecognized. 
the match routine can automatically correct the 
second error once the operator has corrected the 
first error. This method also allows for the possibil- 
ity that the number of words in the output se- 
quence may be different than the correct number 
of words. For example, if one word is mistakenly 
recognized as a pair of words, the operator merely 
corrects the first incorrect word. Once the correct 
(longer) word has been substituted for the first 
wrong word, the re-recognition will continue from 



the end of the corrected longer word. 

Once step 424 has performed this phrase rec- 
ognition, and selected the best scoring word se- 
quence starting with the current selected word. 

5 step 426 subtracts the current selected word from 
that word sequence and makes the remaining 
words of the sequence the new 
PHRASE_STARTSTRING. Then the program aa- 
vances to step 404, which displays the new 

10 PHRASE_STARTSTRING in the phrase definition 
window preceded by 

ALREADY_C0NFIRMED_STR1NG. which is 
shown underlined to distinguish it from the 
PHRASE_STARTSTRING. Then step 406. selects 

;5 the first word of the new PHRASE_STARTSTRING 
as the selected word and shows it in reverse video, 
step 408 performs single word recognition on the 
portion of REC_TOK corresponding to the se- 
lected word, and step 410 displays the word choice 

20 from this recognition. Then the program advances 
to polling loop 1 03 b for the next user input. 

If, when the program advances to polling loop 
103b. the user enters an enter-phrase command 
such as the pressing of the "enter" key or the 

25 speaking of the word "enter that", the test of step 

414 will be satisfied and steps 430 and 432 are 
executed. The enter-phrase commands are used to 
indicate that the user desires to confirm the entire 
phrase currently in the phrase definition window. 

30 Step 430 adds the 

ALREADY_C0NF1RMED__STR1NG to the start of 
the current PHRASE_STARTSTRING to form the 
phrase_startstring to be to be confirmed. Then 
step 432 jumps to J2 in FIG. 25 with a pick-choice 

35 command which causes the branch of the program 
under step 107 of FIG. 25 to confirm the entire 
PHRASE__STARTSTRING just as if it were a mul- 
tiword phrase from the phrase active window 701a 
which had been confirmed by the pressing of a 

40 function key. 

If, when the program advances to polling loop 
103b. the user enters a delete command of the 
type described above, the test of step 106b is met 
and the program advances to step 434 which 

45 jumps to location J3 on FIG. 25. This cause the 
program to abort recognition of the entire token 
represented by REC__TOK. just as if the user had 
entered that delete command before ever entering 
the phrase edit routing of FIG. 26. 

50 If. when the program advances to polling loop 
103b, the user enters a letter command, to start, 
add to. or edit the WORD_STARTSTRING. the 
test of step 109b is met. The letter commands 
recognized in the phrase edit routine of FIG. 26 are 

55 the same as are those recognized in the EDIT- 
MODE in the program of FIG. 1 , described above. 
After the test of step 109b has been met, step 436 
adds the letter, if any. indicated by the letter com- 
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martd to the word^startstring. If, on the other hand, 
the Fetter command is the command for deleting 
the last character of the WORD_STARTSTRING, 
that last character, if any, is deleted from that 
stnng. Then step 438 restricts the active vocabu- 
lary to acoustic word models whose corresponding 
words start with the WORD_STARTSTRING. Next 
step 440 restricts the backup vocabulary to up to 
nine words from the backup dictionary which start 
with the WORD_STARTSTRING and which are 
selected by the language model filter as most likely 
given the word which immediately proceeds the 
selected word. Then step 442 displays the 
WORD_STARTSTRING in a word definition win- 
dow 705 (shown in FIG. 35). which is shown in 
reverse video in the part of the 
PHRASE_DEFINITION window where the selected 
word formerly was located. 

Once this is done, step 409 performs single 

word rerecognition upon REC ^TOK starting at the 

portion of that token time aligned with the start of 
the selected word. This recognition uses the same 
algorithm as is used in the recognition of step 408, 
except that it uses the restricted vocabulary pro- 
duced by step 438. This recognition also uses a 
watchdog routine which jumps to polling loop 103b 
if the user enters any input before the recognition 
is compjete, so that, if the user rapidly enters a 
sequence of letter commands, the system does not 
have to complete a separate rerecognition after the 
entry of each before responding to the next. 

After step 409 selects one or more best scor- 
ing rerecognition words, step 446 displays those 
choices in rank order, each with a function key 
number next to it, in the word active window 701b 
below the word definition window 705. Immediately 
below the word active window, step 446 also dis- 
plays the backup words in a word dictionary win- 
dow 702b. If there are enough words in the backup 
vocabulary selected by step 440 to do so, step 446 
displays enough of these words to bring the total 
number of words displayed in both windows 701b 
and 702b to 9. so that each of the function keys 
"fr through "f9" will have a word choice displayed 
next to it. Once step 409 is performed the program 
advances to polling loop 103b for more user input. 

If, when the program advances to polling loop 
103b. the user enters an edit-word command, to 
select one of the words displayed next to a func- 
tion key in the windows 701b or 702b for insertion 
into the W0RD_STARTSTR1NG for editing, the 
test of step 110b is met. The edit-word commands 
are the same as the edit-choice commands de- 
scribed above with regard to FIG. 1. They include 
the double-clicking of any function key "f1" 
through "fg", and the saying of any command 
"edit_one" through "edit__nine", associated with 
one of the word choices displayed in the windows 



701b or 702b. 

When such an edit-word command is detected 
and the test of step 110b is met, step 448 makes 
the chosen word indicated by the edit-word com- 

5 mand the WORD_STARTSTRING. Then the se- 
quence of steps 438 through 446 described above 
are performed, just as if the user had just finished 
entering all of the letters in the spelling of the 
chosen word into the WORD_STARTSTRING by a 

JO sequence of letter-commands. 

Referring now to FIGS. 29-36. the operation of 
the phrase edit routine will be further demonstrat- 
ed. 

The example of these figures assumes the 

J 5 user starts by saying the phrase "A new display". 
In response, the example assumes step 123a of 
FIG. 25, recognizes this phrase, step 108 detects 
that step 123a recognized a new utterance, and 
step 176a displays the results of this recognition in 

20 the phrase active window 701a, as indicated in FIG. 
29, and then goes back to polling loop 103 for the 
user's response. 

The user recognized that the closest response 
is either that associated with "f1" or "f2", so he 

25 double-clicks "f1" to indicate that its associated 
phrase is to be edited. In response the test of step 
110 of FIG. 25 is met, the selected phrase "a 
nudist play" is made the 

PHRASE_STARTSTRING, and the test of step 

30 302 causes the program to jump to J1 on FIG. 26. 
the start of the phrase edit routine. Step 404 then 
displays the PHRASE_STARTSTR1NG in the 
phrase definition window 703a. Step 406 makes the 
first word '*a" of the PHRASE_STARTSTR1NG the 

35 selected word and shows it in reverse video in the 
phrase definition window. Then step 408 performs 
single word recognition on REC_TOK, starting at 
the start of that token, since the first word "a" is 
the selected word. Step 410 displays the choices 

40 from the recognition in the word active window 
701b, and the program advances to 103b to await 
the user's response. This leaves the display look- 
ing like FIG. 30. 

The user presses the right cursor key to select 

45 the second word of the PHRASE_STARTSTRING, 
the word "nudist" as the first word to be corrected. 
This causes the test of step 412 to be met. As a 
result step 416 makes the word "nudist" the se- 
lected word and show it in reverse video. Step 408 

50 then performs single word recognition upon 
REC_TOK starting with the part of that token 
aligned against the selected word "nudisf*. After 
this recognition is performed step 410 displays the 
choices from the recognition in the word active 

55 window 701b. This causes the display to have the 
appearance shown in FiG. 31 . 

When the program next waits for the user's 
input, the user presses the "f2" key to confirm the 
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word "new" to replace the former selected word 
"nudist". Step 107b responds to this pick-choice 
command by causing the branch of the program 
beneath it to be executed. Step 418 substitutes 
"new" for "nudist" in the 

PHRASE_STARTSTRING. Step 420 adds all the 
words of the phrase_startstring up to and includ- 
ing this new word, the words "a new", to the 
AU^EADY_C0NF1RMED_STRING. Step 422 re- 
displays the phrase definition window with the 
ALREADY_CONFIRiVlED_STRING shown under- 
lined and the rest of the window shown in reverse 
video, as indicated in FIG. 32. 

Then step 424 performs phrase recognition 
against REC_TOK starting with the part of the 
token formerly time aligned against the word 
"nudist". It performs this recognition with the possi- 
ble sequence of word choices all limited to se- 
quences starting with the just picked word "new". 
When this recognition has selected a best sconng 
word sequence, which in our example is the se- 
quence "new display", step 426 subtracts the pic- 
ked word "new" from the start of it and makes the 
one or more remaining words of the sequence, in 
this case the single word "display", the new 
PHRASE_STARTSTRIN6. Once this is done the 
program advances to step 404 which redisplays the 
phrase definition window with the 
ALREADY_CONFIRMED_STRING underlined, fol- 
lowed by the new PHRASE__STARTSTRiNG. Then 
step 406 selects the first word of the new 
phrase_startstring and displays it in reverse video 
in the phrase definition window. Next step 408 

performs single word recognition on REG TDK 

starting on the portion of that token time aligned 
against the start of "display" in the last recognition 
of step 424. Then step 410 displays the best 
scoring word choices from that recognition in the 
window 701b before returning to polling loop 103b 
for the next input. This causes the display shown in 
Rg. 33. 

In response the user presses the "enter" key. 
This causes the test of step 414 to be met. Then 
step 430 adds the 

ALREADY_CONFIRMED__STRING, "a new", to 
the start of the PHRASE_STARTSTRING. 
"display", to form the PHRASE_STARTSTRING. 
"a new display". Then step 432 jumps back to J2 
on FIG. 25. causing the PHRASE_STARTSTRING 
to be treated as a confirmed string, in the manner 
described before. After this is done each of the 
words of this display are used to make an acoustic 
model, update the language model, and output to 
the application program, causing the display to 
appear as in FIG. 34. 

Referring now to FIG. 35, if, in our example, 
the user, after seeing the display shown in Fig. 31 , 
decided to correct the selected word "nudist" by 



entering the letter command "n", a different se- 
quence of events would have resulted. Although in 
the example of FIG. 31 the correct word "new" was 
displayed, it will often be the cause that it is not 
5 and the user may need to use the letter commands 
or the edit-word commands to enter the desired 
word. 

If the user presses the "n" key after seeing the 
display of FIG. 31. the test of step 109b will be 

10 met. Step 436 will add the letter "n" to the end of 
the formerly empty WORD_STARTSTRING. Step 
438 will restrict the active vocabulary to words 
starting with the letter "n". Step 440 will pick a 
backup vocabulary of words starting with that letter. 

75 Step 442 will display the WORD_STARTSTRING 
"n" in the word definition window 705 which pops 
up in the phrase definition window in place of the 
selected word. Step 409 performs rerecognition on 
the portion of REC_TOK corresponding to the 

20 selected word using the restricted active vocabu- 
lary. And step 446 displays the word choices re- 
sulting from the rerecognition along with backup 
words in the windows 701b and 702b. This pro- 
duces the display shown in FIG. 35. 

25 In our example it is assumed the user types 

the letter "e". even though the desired word "new" 
is already shown as a selectable word choice. In 
response to the typing of this letter the steps 109b, 
436. 438, 440. 442. 409, and 446 are all repeated 

30 over again producing a WORD_STARTSTRING of 
"ne". and a group of choices in the word active 
window 701b and word dictionary window 702b 
which all start with that string as shown in FIG. 36. 
At this point the user presses the "fl" key to 

35 confirm the word choice "new". Then the branch of 
the program under step 107b, responds to this 
choice in the exact same manner described above 
with regard to FIGS. 32 and 33. 

It should be understood from the description of 

40 the branch of the program under step 110b, the 
user can also make any one of the word choices 
displayed in a word active window 701b or a word 
dictionary window 702b the 

WORD_STARTSTRING merely by double-clicking 

45 the function key associated with that word choice 
or saying the corresponding spoken edit-word com- 
mand. This will have the same results as if he or 
she had typed all of the letters of that word Into the 
WORD_STARTSTRING. 

50 It can be seen that the present invention pro- 
vides a speech recognition system which improves 
the speed and accuracy with which users can 
dictate desired text. It does this by making it easier 
to correct errors and enter new words into the 

55 vocabulary. The invention provides an interactive 
speech recognition system in which the user can 
begin productive work on large vocabulary, natural 
language dictation tasks with little or no advance 
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training of word nnodels. This is because the inven- 
tion enables a user to add words to the speech 
recognition vocabulary in a quick and easy manner, 
by building acoustic models from tokens that are 
collected during actual use of the system, rather 
than during a separate training or enrollment pro- 
cess. And it lets the user enter the spelling of new 
words that are added to the vocabulary with a 
minimum of keystrokes. The invention also devel- 
ops a language model for use in a speech recogni- 
tion system which is customized to a user's par- 
ticular dictation, without requiring much additional 
effort to obtain such customization. It can also be 
seen that the invention provides a system which 
enables a user to easily correct errors made in 
continuous speech recognition. 

Those skilled in the art of computing and 
speech recognition will understand that the inven- 
tion described above, and in the claims that follow, 
can be embodied in many different ways. For ex- 
ample, the embodiments described above are con- 
tained largely in computer programs designed to 
run on general purpose computers such as the IBM 
PC AT and the COMPAQ 386, but it will be clear to 
those skilled in the computer and electronics arts 
that many or all of the basic functions of this 
invention could be performed in special purpose 
hardware. It will also be clear to those skilled in the 
computer arts that the steps of the. program de- 
scribed above, when represented as program 
instructions loaded into program memory represent 
means for accomplishing the tasks of their asso- 
ciated steps, and thus such programs constitue 
both means and methods for purposes of a patent 
application. 

In the embodiment of the invention described 
here, functions are usually selected either by 
pressing keys or speaking a command. It should 
be understood, however, that any other method by 
which the operator may enter commands and char- 
acters into a computer could be used instead. For 
example, a pointing device such as a mouse (or 
cursor keys) may be used to choose items from a 
choice menu or even to choose letters from a 
menu of the alphabet. 

In the embodiment of the invention shown in 
FIGS. 25 and 26, the speech recognition system 
described was designed to perform continuous 
speech recognition or discretely spoken words or 
words spoken in brief multiword phrases. It should 
be understood however that the basic concepts of 
this embodiment could be applied to a system 
designed to recognize continuous speech of unlim- 
ited length. In such a case the phrase active win- 
dow shown could show the best scoring plurality of 
word sequences recognized for the last several 
seconds of speech. Either in addition or alternately, 
the system could make an acoustic description of 



all the speech recognized during a given session, 
and use each word recognized to label the portion 
of that description associated with that recognized 
word. In such an embodiment, the user could re- 

5 turn to any word of the output test produced by the 
recognizer which he or she desired to correct, and 
the system would enable phrase recognition start- 
ing at the portion of the acoustic description cor- 
responding to the word to be corrected in the 

10 general manner described with regard to FIGS. 25 
and 26. 

Those skilled in the computing arts will recog- 
nize that even though the embodiments of the 
invention described above were designed to op- 

75 erate as terminate-and-stay-resident keyboard em- 
ulators, it would be very easy, in alternate embodi- 
ments, to fully integrate the invention into a particu- 
lar application program, such as a word processor, 
spreadsheet, or any other application into which 

20 text is entered. 

It should also be understood that the present 
embodiment could be used with speech recogni- 
tion systems in which the acoustic word models 
are stored, and the speech recognition is per- 

25 formed in a distributed manner, such as in a sys- 
tem using a neural net architecture. 

Accordingly, the present invention should not 
be considered to be limited by the description 
herein of the preferred embodiment, but rather 

00 should be interpreted in accordance with the fol- 
lowing claims: 

Claims 

35 

1 . A system for creating word models compris- 
ing: 

-means for making an acoustic model from one or 
more utterances of a word; 

40 • means for enabling a user to associate a se- 
quence of textual characters with that acoustic 
model, said means including; 
-means for indicating to the user a menu of one or 
more sequences of textual characters; 

45 -means for enabling the user to select a given 
character sequence from the menu; 
-means for enabling the user to edit the selected 
character sequence to make It represent a different 
sequence of characters; 

50 -means for associating said edited character se- 
quence with said acoustic model. 

2. A speech recognition system including the 
system for creating word models recited in Claim 
1 , characterised in that: 

55 -said speech recognition system further includes; 
means for making an acoustic description of a 
given portion of speech to be recognised; 
-means for temporarily storing said acoustic de- 
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scription; 

-means for storing acoustic word models for each 
of a plurality of words; 

-means for storing a sequence of textual characters 
in association with each acoustic word model: and 
-recognition means for selecting which one or more 
of said acoustic models best match said acoustic 
description and for producing a list of those best 
matching acoustic models; 

said means for indicating a menu including means 
for providing the user with a menu of the character 
sequences associated with the best matching 
acoustic models on the list produced by the rec- 
ognition means; and 

-said means for making an acoustic model includ- 
ing means for using as one of the utterances used 
in the making of said acoustic model said acoustic 
description used by said recognition means in se- 
lecting the character sequences indicated in said 
menu. 

-3. A speech recognition system including the 
system for creating word models recited in Claim 
1 , characterised In that: 
-said system further includes; 
-means for making an acoustic description of a 
menu selection command to be recognized; 
-means for storing acoustic models of a plurality of 
menu selection commands each associated with 
one of the sequences of textual characters in- 
dicated by said menu; and 

-recognition means for selecting which of said 
menu selection command models best matches 
said acoustic description of said menu selection 
command to be recognized; and 
-said means for enabling the user to select a given 
character sequence from the menu includes means 
for using said means for making an acoustic de- 
scription of said menu selection command to be 
recognized, said means for storing menu selection 
command models, and said recognition means to 
respond to the user's speaking of a menu selection 
command by selecting the character sequence 
from the menu corresponding to the best matching 
menu selection command model selected by said 
recognition means. 

•4. A speech recognition system including the 
system for creating word models recited in Claim 
1 , characterised in that: 
-said system further includes 
-means for making an acoustic description of an 
editing command to be recognized; 
-means for storing acoustic models of a plurality of 
editing commands each associated with a function 
for editing a sequence of textual characters; 
-recognition means for selecting which one or more 
of said editing command models best matches said 
acoustic description of said editing command to be 
recognized: 



-said means for enabling the user to edit the se- 
lected character sequence includes means for us- 
ing said means for making an acoustic description 
of said editing command to be recognized, said 

5 means for storing editing command models, and 
said recognition means to respond to the user's 
speaking of an editing command by performing 
upon the selected character sequence the editing 
function corresponding to the best matching editing 

JO command model selected by said recognition 
means. 

-5. A speech recognition system which in- 
cludes the system for creating word models de- 
scribed in claim 1 characterised in that, said 

15 speech recognition system further includes 

-means for representing a body of test comprised 
of one or more words and for representing a word 
insertion location relative to said text; 
-recognition means for recognizing a spoken word 

20 by selecting a word which matches said spoken 
word; and 

-means for inserting a representation of either the 
word selected by said recognition means or the 
character sequence selected by the user into said 

25 body of text at said word insertion location. 

-6. An adaptive speech recognition method for 
recognizing a plurality of spoken words over a 
period of time and for improving a set of acoustic 
word models used during that recognition, said 

30 method comprising the steps of: 

-storing a set of such acoustic word models, with 
each such model being stored in association with a 
word label; 

-forming an acoustic description of the sound of 

35 each spoken word to be recognized; 

-attempting to perform automatic speech recogni- 
tion upon each such acoustic description by com- 
paring it against a plurality of said acoustic word 
models to select which one or more of said word 

40 models best match it; 

-storing each of certain acoustic descriptions in 
association with a word label corresponding to the 
spoken word represented by said acoustic descrip- 
tion, and. for each such certain acoustic descrip- 

45 tions. performing this storing after attempting to 
perform such recognition upon it and before at- 
tempting to perform such recognition upon others 
of said acoustic descriptions: 
-associating a given acoustic description with a 

50 word label and updating the acoustic word model 
of that label by seeking to find any acoustic de- 
scriptions previously stored in association witii that 
word label and by merging acoustic data from 
those previously stored descriptions with the 

55 acoustic data from the given acoustic description to 
make the updated word model, and performing this 
storing after attempting to perform such recognition 
upon said given acoustic description and before 
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attempting to perform such recognition upon others 

of said acoustic descriptions: 

-storing the updated acoustic word model in said 

set of word models and using it in the subsequent 

performance of said attempted automatic speech 

recognition. 

-7. A speech recognition system comprising: 
-means for making an acoustic description of a 
given portion of speech to be recognized, as spok- 
en by a given group of one or more speakers; 
-means for storing a plurality of individually trained 
acoustic word models, each of which is associated 
with a given word, and each of which is derived 
from acoustic data produced by having one or 
more speakers from said given group speak one or 
more utterances of its associated word: 
-means for storing a plurality of phonetic acoustic 
word models, each of which is associated with a 
given word, and none of which are derived from 
acoustic data produced by having any speakers 
from said given group speak its associated word; 
-recognition means for comparing both said individ- 
ually trained and said phonetic acoustic word 
models against said acoustic description of a given 
portion of speech to be recognized and for select- 
ing which one or more of said models best match 
said acoustic description. 

•8. A speech recognition system comprising: 
-means for making an acoustic description of a 
given portion of speech to be recognized; 
-means for storing a first, acoustically selectable, 
set of machine responses, each of which is asso- 
ciated with a word; 

-means for storing an acoustic word model of the 
word associated with each of said acoustically 
selectable machine responses; 
means for storing a second, non-acoustically selec- 
table, set of machine responses, each of which is 
associated with a word: 

recognition means for selecting which one or more 
of said acoustic models best match said acoustic 
description; 

recognition indicating means for indicating to a 
user the corresponding one or more acoustically, 
selectable machine responses associated with 
those best matching models; 
-filtering means for selecting a subset of said non- 
acoustically selectable machine responses, said fil- 
tering means making said selection without per- 
forming a match between said acoustic description 
and any acoustic word models; 
-filtering indicating means for indicating to the user 
the non-acoustically selectable machine responses 
selected by the filtering means; 
-means for enabling a user to select one of the 
indicated acoustically selectable or non-acoustically 
selectable machine responses as a desired ma- 
chine response. 



9. A system for enabling a user to create word 
models for use in speech recognition comprising: 

- means for storing a set of machine responses; 
-means for enabling a user to enter filtering in- 

5 formation which does not uniquely identify a de- 
sired machine response, but which does specify a 
subset of machine responses to which the desired 
machine response belongs: and 
-filtering means for responding to the entry of such 

10 filtering information by selecting a subset of said 
machine responses which is limited to the subset 
specified by the filtering information, -means for 
indicating to the user one or more machine re- 
sponses from the subset selected by the filtering 

15 means; 

- means for enabling the user to select which of the 
indicated machine responses is the response to be 
associated with a word nnodel to be trained, without 
requiring the user to enter all the information con- 

20 tained in the machine response: 

- means for making an acoustic description of a 
given portion of speech; 

- means for incorporating data from that acoustic 
description into an acoustic word model associated 

25 with the selected machine response. 

10. A speech recognition system which in- 
cludes the system for enabling a user to create 
word models described in Claim 9. said speech 
recognition system further including: 

30 means for representing a body of text comprised of 
one or more words and for representing a word 
insertion location relative to said text; 

- means for storing a word in association with each 
of said machine responses; 

35 -recognition means for recognising a spoken word 
by selecting a word which matches said spoken 
word; and 

-means for inserting a representation of either the 
word selected by said recognition means or the 
40 word associated with the indicated machine re- 
sponse selected by the user into said body of text 
at said word insertion location. 

11. A system for enabling a user to create 
word models for use in speech recognition com- 

45 prising: 

- means for storing a set of machine responses; 

- means for storing a word in association with each 
of said machine responses: 

- language model means for indicating the prob- 
60 ability that a given word to be trained will be each 

of a plurality of said stored words based on statisti- 
cal information on the frequency of each such 
stored word's use; 

- filtering means for selecting a subset of said 
55 machine responses based on the probabilities, in- 
dicated by said language model means, of the 
stored word associated with each such machine 
response; 



25 



49 ' EP 0 376 501 A2 



50 



- means for indicating to the user the one or more 
machine responses selected by the filtering means; 

- means for enabling the user to select which of the 
indicated machine responses is the response to be 
associated with a' word model to be trained, without 
requinng the user to enter alt the information con- 
tained in the machine response: 

-means for making an acoustic description of a 
given portion of speech spoken by the user; 

- means for incorporating data from that acoustic 
descnption into an acoustic model associated with 
the selected machine response. 

12. A continuous speech recognition system 
comprising: 

-means for making an acoustic description of a 
given portion of speech to be recognized; 
-means for storing acoustic models of a plurality of 
words; 

-recognition means for matching sequences of 
acoustic word models against said portion of 
speech and for selecting a plurality of the best 
matching word sequences, each representing a se- 
quence of words whose sequence of corresponding 
acoustic models provides one of the best matches 
against said acoustic description; 
-means for indicating to a user each of said plural- 
ity of best matching word sequences; and 
-means for enabling the user to select one of the 
indicated best matching word sequences for use as 
an output, without requiring the user to enter each 
word in the selected sequence. 

13. A continuous speech recognition system 
compnsing: 

-means for making an acoustic description of a 
given portion of speech to be recognized; 
-means for storing acoustic models of a plurality of 
words; 

-recognition means for matching sequences of 
acoustic word models against said acoustic de- 
scription and for selecting a best matching word 
sequence, representing a sequence of words 
whose sequence of corresponding acoustic models 
provide one of the best matches against said 
acoustic description: 

-means for indicating the words of said best match- 
ing word sequence to a user; 
-means for enabling the user to select an individual 
word from the indicated best matching sequence of 
words; 

-means for enabling the user to correct the in- 
dicated best matching word sequence by correct- 
ing the selected word; 

-means for using the indicated best matching word 
sequence, with the corrected selected word as an 
output. 

14. A speech recognition system comprising: 
-means for storing a list of machine responses, 
each of which has associated with it a spelling 



comprised of a sequence of characters and a pro- 
nunciation; 

-means for storing an acoustic model for the pro- 
nunciation associated with each of the machine 

5 responses: 

-means for making an acoustic description of a 
portion of speech to be recognized; 
-means for enabling a user to enter a string of one 
or more characters as filtenng information: 

w -means for enabling the user to edit said string of 
characters once it has been entered: 
-filtering means for responding to the entry and 
editing of said string by selecting a subset of 
machine responses associated with spellings which 

;5 contain the string of one or more characters as 
entered and edited by said user; and 
-recognition means for making a filtered selection 
of which one or more of said acoustic models best 
match said acoustic description of said portion of 

20 speech to be recognized, including means for 
causing the selection by said recognition means to 
favor the selection of acoustic models whose asso- 
ciated machine responses are in said subset se- 
lected by said filtering means. 

25 15. A speech recognition system designed to 

. recognise a series of spoken words, said system 
comprising: 

language model means for indicating the probabil- 
ity that a given word to be recognised will be each 
30 of a plurality of vocabulary words based on statisti- 
cal information on the frequency of that word's use; 

- means for making an acoustic description of the 
utterance of each of said series of spoken words to 
be recognised; 

35 - means for storing acoustic models of a plurality of 
vocabulary words; 

- recognition means for selecting which one or 
more of said vocabulary word's acoustic models 
best match a given acoustic description of a word 

40 to be recognised, based both on the closeness of 
the match between said acoustic models and said 
acoustic description and on the probability indica- 
tions by said language mode! means; 

- means for using the selection by said recognition 
45 means of one or more vocabulary words as best 

matching said given acoustic description to update 
the statistical information on the frequency of said 
one or more vocabulary words in said language 
model means and for causing said recognition 
50 means to use a probability based on said updated 
statistical infonmation in the recognition of which 
vocabulary words best match acoustic descriptions 
of subsequent words. 

55 
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